KAFKA

Pranav Kumar
5 min readOct 25, 2022

It’s been a long time, isn’t it.

Source — Google Images

Kafka is an event streaming platform which collect, process, store and integrate data at scale. It is also known as Message Queue.

People also call it as Data Subscription Platform because if Kafka has the data then anyone can have it.

KAFKA is really good in moving the data real quick, smoothly and at scale.

It was created by LinkedIn in 2011. Later, it went open source under Apache. You can buy the commercial version from CLOUDERA, CONFLUENT and IBM.

You can integrate it’s data from FLUME as well. By default, it can hold data for 7 days. It is Distributed, Resilient and Fault Tolerant in nature.

It supports Horizontal Scalability. It can scale up to 100 Brokers and can scale to millions of messages per seconds. It’s latency is less than 10 milliseconds.

Kafka has it’s own cluster running. It’s used for messaging system, activity tracking and stream processing. Netflix uses it to apply recommendations, Uber uses it to gather real-time trip data and LinkedIn uses ot to prevent spam.

#) TOPIC — Particular Stream of Data

We can have as many topics as we want. Topic is identified by it’s name. It supports any kind of message format such as JSON, AVRO, TEXT. Sequence of messages is called DATA STREAM. We cannot query topics. So, we use Kafka PRODUCERS to send data and Kafka CONSUMERS to read the data.

#) PARTITIONS AND OFFSETS

Topics are split in Partitions. Eg:- 100 partitions Messages within each partitions are ordered.

Source — Google Images

As the message is going to increase, it is going to write in population and I’D is going to increase. Each message inside a partition gets an incremental id called OFFSET.

Kafka Topics are immutable. Offset have a meaning for specific partition. Offsets are not re-used. Order is guaranteed only within a partition. data is assigned randomly to a partition unless a key is provided.

#) TOPIC EXAMPLE — CARS_GPS

  1. Each car reports it’s GPS position to Kafka.
  2. Each car will send a message to Kafka every 20 seconds.
  3. Each message contains car Id and car position(latitude & longitude).
  4. Kafka TOPIC contains position of all cars.
  5. We can create Topic with 10 partitions.

#) KAFKA PRODUCERS

Source — Google Images
  1. Producers write data to topics.
  2. Producers know which partition to write to and which Kafka Broker has it.
  3. In case of Kafka broker failures, Producers will automatically recover.

#) PRODUCER MESSAGE KEYS

1. Producers can choose to send a key with message (string, Number, Binary, etc. )

2. If key=NULL, data is sent ROUND ROBIN

3. If key!=NULL, then all the messages having same key will go through same partition (HASHING).

4. A key is typically sent if you need message ordering for a field.

KEY is the heart of KAFKA MESSAGE.

#) KAFKA MESSAGES ANATOMY

Source — GeeksforGeeks

#) KAFKA MESSAGE SERIALIZER

Source — Conduktor
  1. Kafka only accepts bytes as an input from producers and sends bytes out as an output to consumers.
  2. Message Serialization means transforming data into bytes.
  3. They are used on the value and the key.
  4. Kafka Producers come with Common Serializers i.e. String (incl. JSON), Int, Float, Avro, Protobuf

#) KAFKA CONSUMERS

Source — Conduktor
  1. Consumers read data from a topic-pull model.
  2. Consumers automatically know which broker to read from.

#) CONSUMER DESERIALIZER

Source — Conduktor
  1. Deserialize indicates how to transform bytes into objects/data.
  2. They are used on the value and the key of the message.

#) CONSUMER GROUP

Source — Conduktor
  1. All the consumers in an application read data as a consumer groups.
  2. Each consumer within a group reads from exclusive partitions.
  3. If you have more consumers than partitions, some consumers will be active.

#) CONSUMER OFFSETS

Source — Conduktor
  1. Kafka Stores the offsets at which a consumer group has been reading.
  2. The offsets committed are in Kafka Topic named __Consumer__offsets.
  3. The consumer is going to commit an offset once in a while.
  4. And, when the offsets are committed it is going to make the consumers reads.

#) KAFKA BROKERS

Source — Conduktor
  1. A Kafka cluster is composed of multiple brokers (servers).
  2. Each broker is identified by its I’D.
  3. Each broker contains certain topic partitions.
  4. After connecting to any broker, you’ll be connected to entire cluster.
  5. A good number to get started is 3 brokers, but some big clusters have over 100 brokers.

#) KAFKA BROKER DISCOVERY

Source — Conduktor
  1. Every Kafka Broker is also called as BOOTSTRAP SERVER.
  2. That means you only need to connect to one broker and the Kafka clients will know how to be connected to entire cluster.
  3. Kafka client makes connection with bootstrap Broker.
  4. It gives list of all brokers.
  5. It knows which topic and which partition is inside which broker.

--

--