From Scheduled Batch to Micro-Batch Streaming: Hard-Earned Lessons in Delta Index Pipelines

This article explores the migration of a production delta-index pipeline from traditional scheduled batch processing to a micro-batch Spark Structured Streaming architecture. It delves into the reasoning behind rejecting record-level streaming, the replacement of fragile S3 completion markers with partition-based watermarks, techniques for ensuring overlap-window correctness, and the adoption of restart-as-design strategies to achieve better predictability in object-store-based ingestion systems. These lessons, learned through real-world challenges, provide a practical guide for teams considering similar transitions.

Why was record-level streaming rejected in favor of micro-batch streaming?

Record-level streaming processes each incoming data point individually, which introduces significant overhead in object-store environments like Amazon S3. Opening a file for every record leads to high latency and cost, especially when dealing with small files. Additionally, record-level approaches make it difficult to control checkpointing precisely, risking data loss or duplicate processing during failures. The team found that micro-batch streaming, which groups records into small batches (e.g., every 10 seconds or 1000 records), provides a sweet spot: it reduces file system interactions while still offering near-real-time freshness. Micro-batches also allow the use of Spark's structured APIs for fault tolerance and exactly-once semantics, something that record-level streaming struggles to guarantee in object-store backends. By accepting a few seconds of latency, the pipeline gained reliability, lower operational costs, and simpler code maintenance.

From Scheduled Batch to Micro-Batch Streaming: Hard-Earned Lessons in Delta Index Pipelines — Source: www.infoq.com

What were the problems with S3 completion markers and how were they replaced?

Originally, the pipeline relied on writing explicit S3 completion markers after each batch to signal that data was ready for downstream consumers. However, markers introduced a single point of failure: if the marker write failed (due to transient network issues or S3 eventual consistency), downstream processes would either stall or incorrectly reprocess data. Moreover, markers created race conditions when multiple consumers tried to detect new data. The team replaced completion markers with partition-based watermarks, which use the metadata of directory structures (e.g., partition_date=2025-03-17) to determine data availability. Instead of monitoring specific marker files, the streaming job tracks partition creation timestamps, allowing it to safely advance processing windows. This change eliminated the fragility of markers while naturally handling S3's eventual consistency because partitions become visible atomically via listing operations.

How does partition-based watermarking improve reliability?

Partition-based watermarking leverages the fact that object stores group files under logical partitions (e.g., by date/hour). The streaming job maintains a watermark—a threshold timestamp—that represents the latest partition fully ingested. As new partitions are detected via S3 list operations, the watermark advances only when a partition is guaranteed to be complete (e.g., after a grace period for late-arriving data). This approach naturally handles straggling data and eventual consistency without needing ad-hoc markers. The reliability gain comes from decoupling data availability from explicit completion signals: partitions are immutable once written, so listing them gives a definitive view. The team implemented a grace period (e.g., 5 minutes) to account for any late-arriving records, after which the watermark can safely advance. This prevents scenarios where incomplete partitions are processed, reducing correctness issues.

How did the team handle overlap-window correctness?

Overlap windows occur when micro-batches aren't perfectly aligned with partition boundaries, causing potential double-counting or missed data. For example, a batch might span the end of one hour and the start of the next. The team solved this by using idempotent writes and upsert logic in the target delta index. In the streaming pipeline, each record carries a unique identifier (e.g., a hash of its content and timestamp). Downstream, the system uses a merge-on-write pattern: when the same record appears in multiple micro-batches, the delta index either updates the existing entry (if newer) or ignores duplicates. Additionally, the watermark ensures that once a partition is closed, no further records from that partition are processed. This combination guarantees overlap windows do not introduce errors—duplicates are safely deduplicated, and late-arriving data within the grace period is correctly integrated.

What is the "restart-as-design" strategy and why is it important?

The restart-as-design strategy acknowledges that failures are inevitable in distributed systems, especially when dealing with object stores. Instead of aiming for perfect resilience, the team built the pipeline to recover cleanly from any restart—whether planned (e.g., deployments) or unplanned (e.g., crashes). Key elements include storing the watermark state in a durable checkpoint (e.g., HDFS), ensuring that the streaming source respects checkpointed offsets, and making all output operations idempotent. During a restart, the pipeline reads the last committed watermark and resumes from there, avoiding data loss or duplication. This approach simplifies operational complexity because operators don't need to manually reconcile state. The importance lies in predictability: teams can confidently upgrade code or scale clusters without worrying about silent data corruption. It shifts the mindset from preventing failures to designing systems that handle them gracefully.

What are the key takeaways for object-store ingestion systems?

Several lessons from this migration apply broadly to object-store ingestion pipelines. First, avoid record-level streaming when using S3 or similar; micro-batching aligns better with file semantics. Second, replace completion markers with partition-based watermarks to eliminate a common source of flakiness. Third, design for overlap windows using idempotent writes and upserts to maintain correctness. Fourth, adopt restart-as-design to ensure predictable recoverability. Finally, instrument heavily: monitor watermark progression, batch sizes, and file listing latencies to catch issues early. These practices help balance freshness, cost, and reliability, making streaming ingestion viable in environments where batch processing was previously the default.

Darhost