Apache Kafka 101 in a nutshell
- The data structure that Kafka uses to store its messages is the log.
- Kafka uses the distributed log concept. A log represents a series of decisions such as “ knowing what previously happened and what I am to do next”.
- Every message must have a timestamp, if one isn’t provided via code. Kafka will provide one automatically.
- Data stored in the log is retained for 7 days.
- The log is immutable, i.e can’t be changed/deleted. It can only append data. To reiterate logs store data, each event record is written at a particular offset in the log. See diagram (fig 1) for a visual representation after reading about producers and consumers.
- One or many producers can append entries and as noted earlier logs are immutable. Therefore entries can never change or delete existing data.
- Multiple producers can write to logs.
- One or many consumers can read the data “at their own pace” by keeping track of the offset in the logs. This allows consumers to know where they last read data and where to proceed.
- Multiple consumers can read from the log.
- For example lets say a consumer read data from the log at point 3 (offset=3), when the consumer next reads from that log, it will know and will move onto an another point say 4 (offset=4).
- Producers write the data for the consumers to “read”/consume.
- Think of a topic as a category such as “/products”, “/transactions” etc
- The topic is where the producers write to and consumers read from.
- A topic can have one or more partitions.
- The offset numbers are always unique within a partition
- Topics can be split into multiple partitions, that can reside on multiple machines. This is part of what makes Kafka so fast! as they can be load can be balanced.
- Partitions are added to a topic, so that the producers can write/store the data and the consumers can read that data.
- Partitions can be on multiple machines to achieve high performance with horizontal scaling. i.e as more resource is needed — more machines can be added to account for the load.
- Multiple producers can write to different partitions of the same/replicated topic. Or in other words a producer can write to multiple replicated topics.
- For example see fig 2. We have one producer, who is writing to different partitions of the same replicated topic. Notice how the offsets are different? This is because the data is being load balanced between the different partitions. This greatly reduces load and can exponentially improve performance / provide high throughput.
- The example above doesn’t have custom partitioning. Meaning that data/events can’t be guaranteed across partitions. as producers “write at their own speed”. Meaning that the order that the data is added can’t be guaranteed.
- Messages can be sent to the same partition, continuously. Such as “Bike” products could always go to the first partition.
- More commonly a “Round Robin” strategy could be used, so that data is always evenly distributed across the partitions.
No locking concept in Kafka
- It’s great that multiple producers can write to the same topic/partition and that data can be load balanced, but there isn’t a locking concept.
- If producer A was writing a patch of messages to a topic. Producer B wouldn’t wait until Producer A finished writing those messages. Meaning that messages sent to a topic/partition won’t append in the order they were sent.
- In terms of speed and ensuring high throughput, this is great though! As producers are able to asynchronously send batches of messages to one or many topics.
- Consists of one or many servers/nodes
- Kafka brokers run Kafka on one or many nodes/servers.
- Topics are contained within the broker in the cluster
- Recommended to run many brokers/nodes inside a cluster to benefit from Kafka’s ability to replicate data, as noted earlier.
- Each node in a cluster is called a broker.
- Partitions can be replicated across multiple brokers (nodes).
- Out of several nodes in a cluster, one node is chosen as the leader by the controller.
- One of the brokers/nodes will serve as the active controller.
- Responsible for managing the partitions, replicating data, electing a leader broker etc
- Can only be one controller within a cluster.