Modern Data Architecture: From Batch to Real-Time Analytics
Modern Data Architecture: From Batch to Real-Time Analytics
Most companies still move data the same way they did fifteen years ago: a scheduled job runs at midnight, extracts everything that changed since the last run, transforms it, loads it into a warehouse, and business teams query it in the morning. This works, until it doesn't.
Batch pipelines are predictable, operationally simple, and easy to reason about. They also introduce 8 to 24 hours of latency as a structural assumption, not a bug. As businesses began competing on decisions made in seconds β fraud detection, personalized recommendations, dynamic pricing β that assumption became untenable.
The shift to real-time analytics is not about chasing technology for its own sake. It is about making different categories of decision possible. This guide will help you understand the architecture behind that shift, the trade-offs involved, and how to choose the right approach for your specific context.
The limitations of batch processing
Batch ETL (Extract, Transform, Load) pipelines have served data teams well. Tools like Apache Spark, dbt, and Airflow form mature, well-understood stacks. The problems begin when your business requirements outrun what batch can provide.
The core constraint is freshness. In a batch world, your data is always stale by design. The window of staleness is determined by your pipeline schedule, and it cannot be smaller than the job's execution time.
Consider what this means in practice:
- A fraud detection model running on yesterday's transaction data will miss today's fraud patterns.
- A recommendation engine that updates daily cannot respond to what a user did five minutes ago.
- An operations dashboard showing last night's inventory cannot help a warehouse manager make real-time allocation decisions.
- An A/B test that reconciles results daily loses signal during the hours when traffic patterns reveal the most.
These are not edge cases β they are the core use cases of the next generation of data products.
Event-driven architecture: the conceptual shift
The fundamental reorientation in real-time data is treating events as the unit of data, not records or tables. Instead of asking "what changed since midnight?", the question becomes "what just happened?".
An event is an immutable, timestamped record of something that occurred in your system: a user placed an order, a payment failed, a sensor published a temperature reading. Events happen continuously. An event-driven architecture captures each one the moment it occurs and makes it immediately available to any downstream system that needs it.
This changes the shape of your data infrastructure:
- Sources (applications, databases, APIs) emit events when state changes occur.
- A message broker (Kafka, Pulsar, Kinesis) captures and durably stores those events in ordered, replayable logs.
- Consumers (stream processors, analytics engines, ML pipelines) read from those logs in near real-time and compute results.
The key insight is that the event log becomes the system of record. It is not a transport mechanism β it is immutable history. Every system downstream derives its state by reading from and processing that log.
Apache Kafka: the backbone of real-time data platforms
Kafka is the most widely adopted distributed messaging system in the enterprise, and for good reason. Understanding its model is essential to understanding modern streaming architecture.
Core Kafka concepts
Topics are append-only, ordered logs. Producers write events to a topic; consumers read from it. Topics are partitioned for horizontal scalability β each partition is an independent ordered log, and multiple consumers can read from different partitions in parallel.
Consumer groups enable both parallelism and fault tolerance. Each consumer in a group reads from a distinct subset of partitions. If a consumer fails, Kafka reassigns its partitions to other members of the group.
Retention is configurable and independent of consumption. You can retain events for 7 days, 30 days, or indefinitely. This enables replay β the ability to re-process an entire event stream from the beginning, which is transformative for data pipelines.
Offsets are pointers to a consumer's position in a partition. Because offsets are explicit, a consumer can process events and commit its offset independently. This allows exactly-once, at-least-once, or at-most-once delivery semantics depending on your requirements.
What Kafka makes possible
- Decoupled systems β Producers do not wait for consumers. Multiple consumer groups can independently read the same events at their own pace.
- Event replay β Backfill a new data warehouse, re-train an ML model on historical events, or debug a pipeline by replaying the input.
- Reprocessing without re-extraction β The source of truth is the event log. You do not need to query source databases again.
- Fan-out β A single event stream can feed fraud detection, analytics, personalization, and audit logging simultaneously.
Change Data Capture: bridging databases and streams
Not every source system is designed to emit events. Most production databases (Postgres, MySQL, SQL Server) store current state β rows, records, tables β not event history. Change Data Capture (CDC) is the technique that bridges this gap.
CDC reads the database's transaction log (the write-ahead log in Postgres, the binlog in MySQL) and emits a stream of change events β inserts, updates, and deletes β as they occur. The result: your relational database becomes an event source without any application-level changes.
Debezium: the standard for CDC
Debezium is the dominant open-source CDC platform. It connects to your database's log, captures every change, and publishes those changes as structured events to Kafka topics.
A Debezium event for an order update looks like this:
{
"before": { "order_id": 1001, "status": "pending", "amount": 249.99 },
"after": { "order_id": 1001, "status": "confirmed", "amount": 249.99 },
"op": "u",
"ts_ms": 1711737600000,
"source": { "table": "orders", "db": "ecommerce" }
}
This event contains the complete before and after state, the operation type, and a precise timestamp. Every downstream system that cares about order status changes β analytics, inventory, notifications β can subscribe to this topic and react in under a second.
CDC use cases in production
- Cache invalidation β Clear a Redis cache entry the moment the underlying row changes, eliminating TTL-based stale data.
- Data warehouse sync β Push changes to Snowflake or BigQuery within seconds instead of waiting for nightly ETL.
- Microservice decoupling β Remove direct service-to-service calls by having services subscribe to each other's change events.
- Audit trails β Capture immutable records of every data change without instrumentation in application code.
Stream processing: computing over events in motion
Capturing events is the first step. The value comes from computing over them. Stream processing frameworks execute continuous queries over infinite data streams, producing results that update as new events arrive.
Apache Flink and Kafka Streams
Apache Flink is the industry standard for stateful, exactly-once stream processing at scale. It supports:
- Windowing β Compute aggregations over time windows (sliding, tumbling, session). A sliding 5-minute window over payment events computes the fraud rate updated every second.
- Stateful operations β Maintain state across events from the same user, session, or entity. Join an order event with the user's history without a database lookup.
- Exactly-once semantics β Even in the presence of failures, each event is processed precisely once. Critical for financial and compliance workloads.
- Event-time processing β Reason about when events actually occurred, not when they arrived, handling out-of-order data correctly.
Kafka Streams is a lighter-weight option for teams already invested in Kafka. It runs inside your application, requires no separate cluster, and handles most stream processing use cases with far less operational overhead than Flink.
The materialized view pattern
One of the most powerful patterns in streaming is the materialized view: a continuously-updated projection of your event stream that consumers can query synchronously.
Instead of querying a database and running aggregations on demand, you pre-compute the result as events flow through the system. A pre-aggregated dashboard can then read from state stored in Redis or updated in Postgres with negligible latency.
The real-time data stack in practice
A production real-time analytics platform typically combines these layers:
| Layer | Technology | Role | |---|---|---| | Ingestion | Debezium, custom producers | Capture changes from source systems | | Transport | Apache Kafka | Durable, scalable message log | | Processing | Flink, Kafka Streams | Stateful computation, enrichment | | Serving | Redis, DynamoDB, Postgres | Low-latency query layer | | Analytics | Clickhouse, Druid, BigQuery | OLAP queries over event history | | Orchestration | Airflow, Temporal | Workflow coordination |
When to use batch, when to use streaming
This is the question that actually matters, and the honest answer is: most systems need both.
Use batch processing when:
- Your downstream consumers can tolerate multi-hour latency. Nightly reports, historical analysis, model training pipelines, and regulatory reporting rarely need real-time freshness.
- Data volumes are massive but arrival is infrequent. Processing 10TB of log files once a day is far cheaper in a batch job than streaming it continuously.
- The computation is complex and stateful in ways that streaming frameworks handle poorly (massive joins across billions of records, complex aggregations over long history windows).
- Operational simplicity is a priority. Batch pipelines are easier to debug, test, and monitor than streaming pipelines.
Use streaming when:
- User-facing features require sub-second or sub-minute data freshness. Fraud detection, personalized recommendations, live inventory.
- Events need to trigger actions, not just update state. A payment failure should immediately trigger a retry workflow, not wait for a batch job.
- Multiple downstream systems need access to the same change event simultaneously. Fan-out is where streaming's value is highest.
- You need to react to patterns across events, not just individual records. High-frequency transactions from the same card, sudden traffic spikes, sequential user behavior.
The Lambda and Kappa architectures
Two architectural patterns have emerged for systems that need both batch accuracy and streaming speed.
Lambda Architecture runs a batch layer and a streaming layer in parallel. The batch layer computes accurate, complete results on a slow schedule; the streaming layer computes approximate, near-real-time results quickly. A query-time merge combines both. The trade-off is significant operational complexity β you maintain two codebases computing essentially the same logic.
Kappa Architecture eliminates the batch layer entirely. All processing runs as a stream job. Historical reprocessing is handled by replaying the event log from Kafka. The trade-off is that you need large Kafka retention and must handle reprocessing carefully. This architecture has become increasingly viable as streaming frameworks have matured and Kafka's retention capabilities have grown.
Most modern platforms are moving toward Kappa-style architectures, supplemented by purpose-built OLAP engines (Clickhouse, Apache Druid) that can query historical event data efficiently without a separate batch layer.
Starting the migration: a practical path
Migrating from batch to streaming is a journey, not a big bang. The most successful migrations follow an incremental path:
- Identify your highest-value use case β Pick the one place where real-time freshness would create immediate business value. Fraud detection. Live inventory. Personalization. Start there.
- Stand up Kafka first β Before processing, focus on getting events flowing reliably. Debezium for existing databases, custom producers for application events.
- Run batch and streaming in parallel β During the transition, compute results in both systems and compare. Build confidence before switching production traffic.
- Expand incrementally β Once the first use case is live and stable, apply the same pattern to the next one. Build organizational knowledge progressively.
- Invest in observability β Streaming pipelines fail differently from batch jobs. Consumer lag, partition imbalance, and processing delays are your primary signals.
Closing thought
The question is not whether to move toward real-time data architecture β the business pressures are clear. The question is where on the spectrum between batch and streaming your specific use cases actually need to live, and what the operational cost of that position is.
Batch is not wrong; it is often exactly right. Streaming is not always necessary; it is sometimes essential. The best data engineers hold both in their toolkit and make the choice deliberately, based on freshness requirements, operational capacity, and the value that real-time data will actually unlock.
The most expensive mistake is adopting streaming complexity for problems that batch would solve perfectly well.