Welcome to Building Event Streaming Pipelines Using Kafka. After watching this video, you will be able to: Describe the core components of Kafka. Use Kafka to publish (write) and subscribe to (read) streams of events. Use Kafka to consume events, either as they occur or retrospectively, and Describe an end-to-end event streaming pipeline example. A Kafka cluster contains one or many brokers. You may think of a Kafka broker as a dedicated server to receive, store, process, and distribute events. Brokers are synchronized and managed by another dedicated server called ZooKeeper. For example, here we have A log topic and a transaction topic in broker 0. A payment topic and a GPS topic in broker 1. and, a user click topic and user search topic in broker 2. Each broker contains one or many topics. You can think of a topic as a database to store specific types of events, such as logs, transactions, and metrics, for example. Brokers manage to save published events into topics and distribute the events to subscribed consumers. Like many other distribution systems, Kafka implements the concepts of partitioning and replicating. It uses topic partitions and replications to increase fault-tolerance and throughput so that event publication and consumption can be done in parallel with multiple brokers. In addition, even if some brokers are down, Kafka clients are still able to work with the target topics replicated in other working brokers. For example: A log topic has been separated into two partitions: 0, 1, and a user topic has been separated into two partitions: 0, 1. And each topic partition is duplicated into two replications and stored in different brokers. The Kafka CLI, or command line interface client provides a collection of powerful script files for users to build an event streaming pipeline: The Kafka-topics script is the one you probably will be using often to manage topics in a Kafka cluster. It is straightforward. Let’s have a look at some common usages: This first one is to create a topic. Here we are trying to create a topic called ‘log_topic’ with two partitions and two replications. One important note here is that many Kafka commands, like kaf-topics, require users to refer to a running Kafka cluster with a host and a port, such as a localhost with the port 9092. After you have created some topics, you can check all created topics in the cluster using the ‘list option.’ And, if you want to check more details of a topic, such as partitions and replications, you can use the ‘describe option’. And you can delete a topic using the ‘delete option.’ Next, you will find out more about publishing events using Kafka producers. Features of Kafka producer: They are client applications that publish events to topic partitions according to the same order as they are published. When publishing an event in Kafka producer, an event can be optionally associated with a key. Events associated with the same key will be published to the same topic partition. Events not associated with any key will be published to topic partitions in rotation. Let’s see how you can publish events to topic partitions using the following example: Suppose you have an event source 1, which generates various log entries, and an event source 2, which generates user-activity tracking records. Then, you can create a Kafka producer to publish log records to log topic partitions, and a user producer to publish user-activity events to user topic partitions, respectively. When you publish events in producers, you can choose to associate events with a key, for example, an application name or a user ID. Similar to the Kafka topic CLI, Kafka provides the Kafka producer CLI for users to manage producers. The most important aspect is starting a producer to write or publish events to a topic: Here you start a producer and point it to the log_topic, then you can type some messages in the console to start publishing events. For example, log1, log2, and log3. You can provide keys to events to make sure the events with the same key will go to the same partition. Here you are starting a producer to user_topic, with the parse.key option to be true, and you also specify the key.separator to be comma. Then you can write messages as follows: key: ‘user1’, value: ‘login website’. Key: ‘user1’, value: ‘click the top item’. And, key: ‘user1’, value: ‘logout website’. Accordingly, all events about user1 will be saved in the same partition to facilitate the reading for consumers. Once events are published and properly stored in topic partitions, you can create consumers to read them. Consumers are client applications that can subscribe to topics and read the stored events. Then event destinations can further read events from Kafka consumers. Consumers read data from topic partitions in the same order as they are published. Consumers also store an offset for each topic partition as the last read position. With the offset, consumers are guaranteed to read events as they occur. A playback is also possible by resetting the offset to zero. This way the consumer can read all events in the topic partition from the beginning. In Kafka, producers and consumers are fully decoupled. As such, producers don’t need to synchronize with consumers, and after events are stored in topics, consumers can have independent schedules to consume them. To read published log and user events from topic partitions, you will need to create log and user consumers, and make them subscribe to corresponding topics. Then Kafka will push the events to those subscribed consumers. Then, the consumers will further send to event destinations. To start a consumer is also easy, using the Kafka consumer script. Let’s read events from the log_topic. You just need to run the Kafka-console-consumer script and specify a Kafka cluster and the topic to subscribe to. Here, you can subscribe to and read events from the topic log_topic. Then the started consumer will read only the new events, starting from the last partition offset. After those events are consumed, the partition offset for the consumer will also be updated and committed back to Kafka. Very often a user wants to read all events from the beginning, as a playback of all historical events. To do so, you just need to add the ‘from beginning’ option. Now, you can read all events starting from offset 0. Let’s have a look at a more concrete example to help you understand how to build an event streaming pipeline end-to-end. Suppose you want to collect and analyze weather and Twitter event streams, so that you can correlate how people talk about extreme weather on Twitter. Here you can use two event sources: IBM Weather API to obtain real-time and forecasted weather data in JSON format. And Twitter API to obtain real-time tweets and mentions, also in JSON format. To receive weather and twitter JSON data in Kafka, you then create a weather topic and a Twitter topic in a Kafka cluster, with some partitions and replications. To publish weather and Twitter JSON data to the two topics, you need to create a weather producer and a Twitter producer. The event’s JSON data will be serialized into bytes and saved in Kafka topics. To read events from the two topics, you need to create a weather consumer and a Twitter consumer. The bytes stored in Kafka topics will be deserialized into event JSON data. If you now want to transport the weather and Twitter event JSON data from the consumers to a relational database, you will use a DB writer to parse those JSON files and create database records. And then you can write those records into a database using SQL insert statements. Finally, you can query the database records from the relational database and visualize and analyze them in a dashboard to complete the end-to-end pipeline. In this video, you learned that: The core components of Kafka are: Brokers: The dedicated server to receive, store, process, and distribute events. Topics: The containers or databases of events. Partitions: Divide topics into different brokers. Replications: Duplicate partitions into different brokers. Producers: Kafka client applications to publish events into topics. And, consumers: Kafka client applications subscribed to topics and read events from them. You also learned the Kafka-topics CLI manages topics. The Kafka-console-producer CLI manages producers. And finally, the Kafka-console-consumer manages consumers.