Back to Blog

Modern Data Architecture: From Batch to Real-Time Analytics

Articleβ€’11 min read
#data-engineering#kafka#streaming#architecture#real-time

Modern Data Architecture: From Batch to Real-Time Analytics

Most companies still move data the same way they did fifteen years ago: a scheduled job runs at midnight, extracts everything that changed since the last run, transforms it, loads it into a warehouse, and business teams query it in the morning. This works, until it doesn't.

Batch pipelines are predictable, operationally simple, and easy to reason about. They also introduce 8 to 24 hours of latency as a structural assumption, not a bug. As businesses began competing on decisions made in seconds β€” fraud detection, personalized recommendations, dynamic pricing β€” that assumption became untenable.

The shift to real-time analytics is not about chasing technology for its own sake. It is about making different categories of decision possible. This guide will help you understand the architecture behind that shift, the trade-offs involved, and how to choose the right approach for your specific context.

The limitations of batch processing

Batch ETL (Extract, Transform, Load) pipelines have served data teams well. Tools like Apache Spark, dbt, and Airflow form mature, well-understood stacks. The problems begin when your business requirements outrun what batch can provide.

The core constraint is freshness. In a batch world, your data is always stale by design. The window of staleness is determined by your pipeline schedule, and it cannot be smaller than the job's execution time.

Consider what this means in practice:

These are not edge cases β€” they are the core use cases of the next generation of data products.

Event-driven architecture: the conceptual shift

The fundamental reorientation in real-time data is treating events as the unit of data, not records or tables. Instead of asking "what changed since midnight?", the question becomes "what just happened?".

An event is an immutable, timestamped record of something that occurred in your system: a user placed an order, a payment failed, a sensor published a temperature reading. Events happen continuously. An event-driven architecture captures each one the moment it occurs and makes it immediately available to any downstream system that needs it.

This changes the shape of your data infrastructure:

The key insight is that the event log becomes the system of record. It is not a transport mechanism β€” it is immutable history. Every system downstream derives its state by reading from and processing that log.

Apache Kafka: the backbone of real-time data platforms

Kafka is the most widely adopted distributed messaging system in the enterprise, and for good reason. Understanding its model is essential to understanding modern streaming architecture.

Core Kafka concepts

Topics are append-only, ordered logs. Producers write events to a topic; consumers read from it. Topics are partitioned for horizontal scalability β€” each partition is an independent ordered log, and multiple consumers can read from different partitions in parallel.

Consumer groups enable both parallelism and fault tolerance. Each consumer in a group reads from a distinct subset of partitions. If a consumer fails, Kafka reassigns its partitions to other members of the group.

Retention is configurable and independent of consumption. You can retain events for 7 days, 30 days, or indefinitely. This enables replay β€” the ability to re-process an entire event stream from the beginning, which is transformative for data pipelines.

Offsets are pointers to a consumer's position in a partition. Because offsets are explicit, a consumer can process events and commit its offset independently. This allows exactly-once, at-least-once, or at-most-once delivery semantics depending on your requirements.

What Kafka makes possible

Change Data Capture: bridging databases and streams

Not every source system is designed to emit events. Most production databases (Postgres, MySQL, SQL Server) store current state β€” rows, records, tables β€” not event history. Change Data Capture (CDC) is the technique that bridges this gap.

CDC reads the database's transaction log (the write-ahead log in Postgres, the binlog in MySQL) and emits a stream of change events β€” inserts, updates, and deletes β€” as they occur. The result: your relational database becomes an event source without any application-level changes.

Debezium: the standard for CDC

Debezium is the dominant open-source CDC platform. It connects to your database's log, captures every change, and publishes those changes as structured events to Kafka topics.

A Debezium event for an order update looks like this:

{
  "before": { "order_id": 1001, "status": "pending", "amount": 249.99 },
  "after": { "order_id": 1001, "status": "confirmed", "amount": 249.99 },
  "op": "u",
  "ts_ms": 1711737600000,
  "source": { "table": "orders", "db": "ecommerce" }
}

This event contains the complete before and after state, the operation type, and a precise timestamp. Every downstream system that cares about order status changes β€” analytics, inventory, notifications β€” can subscribe to this topic and react in under a second.

CDC use cases in production

Stream processing: computing over events in motion

Capturing events is the first step. The value comes from computing over them. Stream processing frameworks execute continuous queries over infinite data streams, producing results that update as new events arrive.

Apache Flink and Kafka Streams

Apache Flink is the industry standard for stateful, exactly-once stream processing at scale. It supports:

Kafka Streams is a lighter-weight option for teams already invested in Kafka. It runs inside your application, requires no separate cluster, and handles most stream processing use cases with far less operational overhead than Flink.

The materialized view pattern

One of the most powerful patterns in streaming is the materialized view: a continuously-updated projection of your event stream that consumers can query synchronously.

Instead of querying a database and running aggregations on demand, you pre-compute the result as events flow through the system. A pre-aggregated dashboard can then read from state stored in Redis or updated in Postgres with negligible latency.

The real-time data stack in practice

A production real-time analytics platform typically combines these layers:

| Layer | Technology | Role | |---|---|---| | Ingestion | Debezium, custom producers | Capture changes from source systems | | Transport | Apache Kafka | Durable, scalable message log | | Processing | Flink, Kafka Streams | Stateful computation, enrichment | | Serving | Redis, DynamoDB, Postgres | Low-latency query layer | | Analytics | Clickhouse, Druid, BigQuery | OLAP queries over event history | | Orchestration | Airflow, Temporal | Workflow coordination |

When to use batch, when to use streaming

This is the question that actually matters, and the honest answer is: most systems need both.

Use batch processing when:

Use streaming when:

The Lambda and Kappa architectures

Two architectural patterns have emerged for systems that need both batch accuracy and streaming speed.

Lambda Architecture runs a batch layer and a streaming layer in parallel. The batch layer computes accurate, complete results on a slow schedule; the streaming layer computes approximate, near-real-time results quickly. A query-time merge combines both. The trade-off is significant operational complexity β€” you maintain two codebases computing essentially the same logic.

Kappa Architecture eliminates the batch layer entirely. All processing runs as a stream job. Historical reprocessing is handled by replaying the event log from Kafka. The trade-off is that you need large Kafka retention and must handle reprocessing carefully. This architecture has become increasingly viable as streaming frameworks have matured and Kafka's retention capabilities have grown.

Most modern platforms are moving toward Kappa-style architectures, supplemented by purpose-built OLAP engines (Clickhouse, Apache Druid) that can query historical event data efficiently without a separate batch layer.

Starting the migration: a practical path

Migrating from batch to streaming is a journey, not a big bang. The most successful migrations follow an incremental path:

  1. Identify your highest-value use case β€” Pick the one place where real-time freshness would create immediate business value. Fraud detection. Live inventory. Personalization. Start there.
  2. Stand up Kafka first β€” Before processing, focus on getting events flowing reliably. Debezium for existing databases, custom producers for application events.
  3. Run batch and streaming in parallel β€” During the transition, compute results in both systems and compare. Build confidence before switching production traffic.
  4. Expand incrementally β€” Once the first use case is live and stable, apply the same pattern to the next one. Build organizational knowledge progressively.
  5. Invest in observability β€” Streaming pipelines fail differently from batch jobs. Consumer lag, partition imbalance, and processing delays are your primary signals.

Closing thought

The question is not whether to move toward real-time data architecture β€” the business pressures are clear. The question is where on the spectrum between batch and streaming your specific use cases actually need to live, and what the operational cost of that position is.

Batch is not wrong; it is often exactly right. Streaming is not always necessary; it is sometimes essential. The best data engineers hold both in their toolkit and make the choice deliberately, based on freshness requirements, operational capacity, and the value that real-time data will actually unlock.

The most expensive mistake is adopting streaming complexity for problems that batch would solve perfectly well.