Real-Time Data Processing with Onyx and Kafka

Explore building real-time data processing pipelines using Onyx and Kafka for streaming data. Learn about Onyx's architecture, defining jobs, and leveraging Kafka as a data source and sink.

Real-Time Data Processing with Onyx and Kafka: This lesson explains how real-Time Data Processing with Onyx and Kafka fits into Clojure design, where it helps, and which trade-offs matter in practice.

In the realm of data engineering, real-time data processing has become a critical component for modern applications. The ability to process data as it arrives allows businesses to make timely decisions, enhance user experiences, and maintain competitive advantages. In this section, we will explore how to build real-time data processing pipelines using Onyx and Kafka, two powerful tools in the Clojure ecosystem.

Introduction to Onyx

Onyx is a distributed, masterless, fault-tolerant data processing platform designed for high-throughput and low-latency workloads. It is built on top of Clojure and leverages its functional programming paradigms to provide a flexible and expressive way to define data processing workflows.

Onyx Architecture

Onyx’s architecture is designed to be scalable and resilient. It consists of several key components:

  • Peers: These are the worker nodes that execute tasks. Peers can be added or removed dynamically, allowing the system to scale horizontally.
  • ZooKeeper: Used for coordination and state management. It helps in leader election and maintaining the cluster’s metadata.
  • Task Lifecycle: Onyx tasks have a well-defined lifecycle, including stages like start, stop, and checkpointing, which aid in fault tolerance and state recovery.
  • Catalog: A centralized repository for task definitions, schemas, and workflow configurations.
    graph TD;
	    A["Data Source"] -->|Stream| B["Kafka"];
	    B --> C["Onyx Peers"];
	    C --> D["Data Sink"];
	    C --> E["ZooKeeper"];
	    E --> C;

Figure 1: Onyx Architecture with Kafka Integration

Using Kafka as a Data Source and Sink

Apache Kafka is a distributed event streaming platform capable of handling trillions of events a day. It is often used as a data source and sink in real-time processing systems due to its durability, scalability, and high throughput.

Kafka as a Data Source

Kafka topics can be used to ingest data into Onyx workflows. Each message in a Kafka topic can be processed as a separate event, allowing for fine-grained control over data processing.

Kafka as a Data Sink

Processed data can be written back to Kafka topics, enabling downstream systems to consume the results. This is particularly useful in microservices architectures where different services may need access to processed data.

Defining Onyx Jobs and Workflows

An Onyx job is a collection of tasks that define how data should be processed. Each task represents a unit of work, such as filtering, transforming, or aggregating data.

Example: Defining an Onyx Workflow

Let’s walk through an example of defining an Onyx workflow that reads data from a Kafka topic, processes it, and writes the results back to another Kafka topic.

 1(ns my-onyx-job
 2  (:require [onyx.api :as onyx]
 3            [onyx.plugin.kafka :as kafka]))
 4
 5(def workflow
 6  [[:in :process]
 7   [:process :out]])
 8
 9(def catalog
10  [{:onyx/name :in
11    :onyx/plugin :onyx.plugin.kafka/input
12    :onyx/type :input
13    :onyx/medium :kafka
14    :kafka/topic "input-topic"
15    :kafka/group-id "my-group"
16    :onyx/batch-size 1000}
17
18   {:onyx/name :process
19    :onyx/fn ::process-data
20    :onyx/type :function
21    :onyx/batch-size 1000}
22
23   {:onyx/name :out
24    :onyx/plugin :onyx.plugin.kafka/output
25    :onyx/type :output
26    :onyx/medium :kafka
27    :kafka/topic "output-topic"
28    :onyx/batch-size 1000}])
29
30(defn process-data
31  [segment]
32  ;; Transform the data
33  (assoc segment :processed true))
34
35(defn start-job []
36  (onyx/start-job {:workflow workflow
37                   :catalog catalog
38                   :lifecycles []
39                   :task-scheduler :onyx.task-scheduler/balanced}))

Code Example 1: Defining an Onyx Workflow with Kafka Integration

In this example, we define a simple Onyx workflow with three tasks: :in, :process, and :out. The :in task reads data from a Kafka topic, the :process task applies a transformation, and the :out task writes the processed data back to another Kafka topic.

Fault Tolerance and Scalability

Onyx is designed to be fault-tolerant and scalable. It achieves this through several mechanisms:

  • Checkpointing: Onyx periodically saves the state of each task, allowing it to recover from failures without data loss.
  • Dynamic Scaling: Peers can be added or removed from the cluster without downtime, enabling the system to handle varying workloads.
  • Backpressure Management: Onyx can adjust the rate of data ingestion based on the processing capacity of the system, preventing overloads.

Benefits of Real-Time Processing

Real-time data processing offers several benefits:

  • Timely Insights: Businesses can make decisions based on the most current data, improving responsiveness and competitiveness.
  • Enhanced User Experiences: Applications can provide real-time feedback and updates to users, increasing engagement and satisfaction.
  • Operational Efficiency: Automated real-time processing reduces the need for manual intervention, lowering operational costs.

Practice Prompt

To get hands-on experience with Onyx and Kafka, try modifying the example workflow to include additional processing steps, such as filtering or aggregating data. Experiment with different batch sizes and observe how they affect performance.

Visualizing Real-Time Data Processing

    sequenceDiagram
	    participant Kafka as Kafka
	    participant Onyx as Onyx
	    participant ZooKeeper as ZooKeeper
	    participant DataSink as Data Sink
	
	    Kafka->>Onyx: Produce messages to topic
	    Onyx->>ZooKeeper: Coordinate task execution
	    Onyx->>DataSink: Write processed data
	    ZooKeeper-->>Onyx: Task state updates

Figure 2: Sequence Diagram of Real-Time Data Processing with Onyx and Kafka

Review Questions

Loading quiz…
Revised on Thursday, April 23, 2026