Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / web

Beginner’s Guide to Understand Kafka

4.94/5 (14 votes)
1 Aug 2020CPOL8 min read 20.7K  
A guide to help learn about Kafka and do a setup & test of data pipeline in Windows environment.
It’s a digital age. Wherever there is data, we hear about Kafka these days. One of my projects that I work on involves entire data system (with Java backend) that leverages Kafka to achieve what deals with tons of data through various channels and departments. While working on it, I thought of exploring the setup in Windows. Thus, this guide helps learn about Kafka and showcases the setup & test of data pipeline in Windows environment.

kafka-logo

An OpenSource Project in Java & Scala

Introduction

Apache Kafka is a distributed streaming platform with three key capabilities:

  • Messaging system – Publish-Subscribe to stream of records
  • Availability & Reliability – Store streams of records in a fault tolerant durable way
  • Scalable & Real time – Process streams of records as they occur

Data System Components

Kafka is generally used to stream data into applications, data lakes and real-time stream analytics systems.

<kafka-highlevel-architecture>

Application inputs messages onto the Kafka server. These messages can be any defined information planned to capture. It is passed across in a reliable (due to distributed Kafka architecture) way to another application or service to process or re-process them.

Internally, Kafka uses a data structure to manage its messages. These messages have a retention policy applied at a unit level of this data structure. Retention is configurable – time based or size based. By default, the data sent is stored for 168 hours (7 days).

Kafka Architecture

Typically, there would be multiples of producers, consumers, clusters working with messages across. Horizontal scaling can be easily done by adding more brokers. The diagram below depicts the sample architecture:

kafka-internals

Kafka communicates between the clients and servers with TCP protocol. For more details, refer to Kafka Protocol Guide.

Kafka ecosystem provides REST proxy that allows an easy integration via HTTP and JSON too.

Primarily, it has four key APIs: Producer API, Consumer API, Streams API, Connector API

Key Components & Related Terminology

  • Messages/Records – byte arrays of an object. Consists of a key, value & timestamp
  • Topic – feeds of messages in categories
  • Producer – processes that publish messages to a Kafka topic
  • Consumer – processes that subscribe to topics and process the feed of published messages
  • Broker – It hosts topics. Also referred as Kafka Server or Kafka Node
  • Cluster – comprises one or more brokers
  • Zookeeper – keeps the state of the cluster (brokers, topics, consumers)
  • Connector – connect topics to existing applications or data systems
  • Stream Processor – consumes an input stream from a topic and produces an output stream to an output topic
  • ISR (In-Sync Replica) – replication to support failover
  • Controller – broker in a cluster responsible for maintaining the leader/follower relationship for all the partitions

Zookeeper

Apache ZooKeeper is an open source that helps build distributed applications. It’s a centralized service for maintaining configuration information. It holds responsibilities like:

  • Broker state – maintains list of active brokers and which cluster they are part of
  • Topics configured – maintains list of all topics, number of partitions for each topic, location of all replicas, who is the preferred leader, list of ISR for partitions
  • Controller election – selects a new controller whenever a node shuts down. Also, makes sure that there is only one controller at any given time
  • ACL info – maintains Access control lists (ACLs) for all the topics

Kafka Internals

Brokers in a cluster are differentiated based on an ID which typically are unique numbers. Connecting to one broker bootstraps a client to the entire Kafka cluster. They receive messages from producers and allow consumers to fetch messages by topic, partition and offset.

A Topic is spread across a Kafka cluster as a logical group of one or more partitions. A partition is defined as an ordered sequence of messages that are distributed across multiple brokers. The number of partitions per topic are configurable during creation.

Producers write to Topics. Consumers read from Topics.

<kafka-partition>

Kafka uses Log data structure to manage its messages. Log data structure is an ordered set of Segments that are collection of messages. Each segment has files that help locate a message:

  1. Log file – stores message
  2. Index file – stores message offset and its starting position in the log file

Kafka appends records from a producer to the end of a topic log. Consumers can read from any committed offset and are allowed to read from any offset point they choose. The record is considered committed only when all ISRs for partition write to their log.

Among the multiple partitions, there is one leader and remaining are replicas/followers to serve as back up. If a leader fails, an ISR is chosen as a new leader. Leader performs all reads and writes to a particular topic partition. Followers passively replicate the leader. Consumers are allowed to read only from the leader partition.

A leader and follower of a partition can never reside on the same node.

leader-follower2

Kafka also supports log compaction for records. With it, Kafka will keep the latest version of a record and delete the older versions. This leads to a granular retention mechanism where the last update for each key is kept.

Offset manager is responsible for storing, fetching and maintaining consumer offsets. Every live broker has one instance of an offset manager. By default, consumer is configured to use an automatic commit policy of periodic interval. Alternatively, consumer can use a commit API for manual offset management.

Kafka uses a particular topic, __consumer_offsets, to save consumer offsets. This offset records the read location of each consumer in each group. This helps a consumer to trace back its last location in case of need. With committing offsets to the broker, consumer no longer depends on ZooKeeper.

Quote:

Older versions of Kafka (pre 0.9) stored offsets in ZooKeeper only, while newer version of Kafka, by default stores offsets in an internal Kafka topic __consumer_offsets

consumer-groups

Kafka allows consumer groups to read data in parallel from a topic. All the consumers in a group have the same group ID. At a time, only one consumer from a group can consume messages from a partition to guarantee the order of reading messages from a partition. A consumer can read from more than one partition.

Kafka Setup On Windows

Pre-Requisite
Setup Files
  1. Install JRE – default settings should be fine
  2. Un-tar Kafka files at C:\Installs (could be any location by choice). All the required script files for Kafka data pipeline setup will be located at: C:\Installs\kafka_2.12-2.5.0\bin\windows
  3. Configuration changes as per Windows need
    • Setup for Kafka logs – Create a folder ‘logs’ at location C:\Installs\kafka_2.12-2.5.0
    • Set this logs folder location in Kafka config file: C:\Installs\kafka_2.12-2.5.0\config\server.properties as log.dirs=C:\Installs\kafka_2.12-2.5.0\logs
    • Setup for Zookeeper data – Create a folder ‘data’ at location C:\Installs\kafka_2.12-2.5.0
    • Set this data folder location in Zookeeper config file: C:\Installs\kafka_2.12-2.5.0\config\zookeeper.properties as dataDir=C:\Installs\kafka_2.12-2.5.0\data
Execute
  1. ZooKeeper – Get a quick-and-dirty single-node ZooKeeper instance using the convenience script already packaged along with Kafka files.
    • Open a command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
    • Execute script:
      zookeeper-server-start.bat C:\Installs\kafka_2.12-2.5.0\config\zookeeper.properties
    • ZooKeeper started at localhost:2181. Keep it running.

      demo-zookeeper

  2. Kafka Server – Get a single-node Kafka instance.
    • Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
    • ZooKeeper is already configured in the properties file as zookeeper.connect=localhost:2181
    • Execute script:
      kafka-server-start.bat C:\Installs\kafka_2.12-2.5.0\config\server.properties
    • Kafka server started at localhost: 9092. Keep it running.

      demo-kafka

      Now, topics can be created and messages can be stored. We can produce and consume data from any client. We will use command prompt for now.

  3. Topic – Create a topic named ‘testkafka
    • Use replication factor as 1 & partitions as 1 given we have made a single instance node
    • Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
    • Execute script:
      kafka-topics.bat --create --bootstrap-server localhost:9092 
                       --replication-factor 1 --partitions 1 --topic testkafka
    • Execute script to see created topic:
      kafka-topics.bat --list --bootstrap-server localhost:9092

      demo-topic

    • Keep the command prompt open just in case.
  4. Producer – setup to send messages to the server
    • Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
    • Execute script:
      kafka-console-producer.bat --bootstrap-server localhost:9092 --topic testkafka
    • It will show a ‘>’ as a prompt to type a message. Type: “Kafka demo – Message from server”.

      Image 10

    • Keep the command prompt open. We will come back to it to push more messages.
  5. Consumer – setup to receive messages from the server
    • Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
    • Execute script:
      kafka-console-consumer.bat --bootstrap-server localhost:9092 
                                 --topic testkafka --from-beginning
    • You would see the Producer sent message in this command prompt window – “Kafka demo – Message from server”.

      demo-consumer

    • Go back to Producer command prompt and type any other message to see them appearing real time in Consumer command prompt:

      kafka-demo

  6. Check/Observe – few key changes behind the scene
    • Files under topic created – they keep track of the messages pushed for a given topic:

      topic-files

    • Data inside the log file – All the messages that are pushed by producer are stored here:

      topic-log

    • Topics present in Kafka – once a consumer starts reading messages from topic, __consumer_offsets is automatically created as a topic:

      topic-present

NOTE: In case you want to choose Zookeeper to store topics instead of Kafka server, it would require the following script commands:

  • Topic create:
    kafka-topics.bat --create --zookeeper localhost:2181 
                     --replication-factor 1 --partitions 1 --topic testkafka
  • Topics view:
    kafka-topics.bat --list --zookeeper localhost:2181

With the above, we are able to see messages sent by Producer and received by Consumer using a Kafka setup.

Image 16

When I tried to setup Kafka, I faced few issues on the way. I have documented them for reference to learn. This should also help others if they face something similar: Troubleshoot: Kafka setup on Windows.

One should not encounter any issues with the below shared files and the steps/commands shared above. Feel free to post your comments/queries below or at my blog here.

References

History

  • 2nd August, 2020: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)