Building Data Pipelines That Actually Scale

Most data pipelines are built for today's data volume. The ones that survive growth are designed with scale constraints in mind from the start — here's what that looks like in practice.

At Verizon, we ran telemetry pipelines that processed data from 200 million devices. Not streaming exactly — more like a continuous high-volume flow of device state, usage, and diagnostic data that had to be ingested, transformed, and made available for analysis and operational response in near-real time. The engineering problems at that scale are different from the ones you hit with 10,000 devices, but the patterns that work are visible earlier if you know what to look for.

The data pipelines that survive scale share common design decisions. The ones that get rebuilt after growth are usually characterized by a handful of specific choices that made sense early and became expensive later.

The Bottleneck Is Almost Never Compute

The most persistent misconception about data pipeline scaling: when it gets slow, add more compute. In practice, the bottleneck is almost always one of three things: the database write layer, uncontrolled data serialization, or I/O at the transformation stage. Adding compute to a pipeline bottlenecked at the database doesn’t help; it makes more requests pile up behind the bottleneck.

The diagnostic before any scaling effort: instrument the pipeline to show latency at each stage. P50, P95, and P99 latency for each transformation step, the queue depth at each buffer point, and the throughput (records/second) through each stage. This data makes the bottleneck obvious. Without it, optimization is guesswork.

If you don’t have per-stage instrumentation in your current pipelines, that’s the first infrastructure work to do. The instrumentation code is not complex; the discipline to add it to every pipeline is the real investment.

Choose the Right Abstraction for the Data Flow

The wrong abstraction for a data pipeline is almost impossible to fix without rewriting. The choices are:

Batch processing — run a job periodically, process all new records, write outputs. Simple to reason about, easy to debug, tolerates failure with retry semantics. Works until the batch window (the time between runs) is too long for the use case, or until the batch size exceeds what can be processed in the time available.

Micro-batch processing — Spark Structured Streaming, Flink in batch mode — processes small batches continuously, with configurable latency. The sweet spot for use cases that need “near real-time” (seconds to minutes) but don’t need the complexity of true streaming.

Stream processing — true event-by-event processing, with strict ordering guarantees and low latency. Necessary for use cases that genuinely require sub-second processing. Significantly more complex to operate correctly, especially around failure handling, exactly-once semantics, and stateful aggregation.

The mistake I see constantly: teams choose stream processing because it sounds more modern, even when their latency requirements would be fully satisfied by micro-batch. Stream processing is not better than batch — it’s more appropriate for specific latency requirements and more expensive to operate correctly. Choose it when you need it, not as a default.

Kafka as the Backbone

For high-volume pipelines with multiple consumers and variable processing rates, a message queue decouples producers from consumers and provides the buffer that absorbs throughput spikes. Kafka is the standard for this use case for good reason: durable message retention, consumer group semantics, and excellent throughput characteristics at scale.

The Kafka design decisions that matter early:

Partition count determines parallelism. Each partition is consumed by one consumer in a consumer group at a time. Too few partitions means consumers can’t parallelize; repartitioning later is operationally painful. Start with more partitions than you currently need (20-30 for a mid-volume topic is not too many) because adding partitions later is not trivial.

Retention policy determines how much history Kafka holds. For operational pipelines where data is consumed quickly, short retention (24-72 hours) is fine. For pipelines where consumers may be behind (batch jobs, analytics), longer retention (7-30 days) lets consumers catch up after downtime without data loss.

Message serialization — use Avro with a schema registry, not JSON. JSON is convenient; Avro with a schema registry gives you schema evolution, compact serialization, and explicit versioning of message formats. The serialization decision is hard to change later because every producer and consumer in the system has to change simultaneously.

Idempotent Processing Is Non-Negotiable

Any pipeline that writes to a database or external system must handle the case where a message is processed more than once. Network failures, consumer restarts, and Kafka’s at-least-once delivery semantics mean duplicate processing is not exceptional — it’s an expected operational mode.

Idempotent writes mean processing the same message twice produces the same result as processing it once. The mechanisms:

Natural idempotency — upserts based on a stable unique key. If the message has a stable ID (device ID + event timestamp is common), an upsert with ON CONFLICT DO UPDATE produces the same result regardless of how many times it runs.

Deduplication at the processing layer — before writing to the database, check whether the message has been processed. An in-memory cache with a TTL handles high-volume deduplication with bounded memory. A Redis sorted set handles deduplication with persistence across consumer restarts.

Outbox pattern for transactional consistency — write to a local outbox table in the same database transaction as the business operation. A separate process reads the outbox and produces to Kafka. This eliminates the dual-write problem where the database write succeeds and the Kafka produce fails (or vice versa).

Schema Evolution Without Downtime

The schema of your data changes over time. Fields get added. Fields get renamed. Field types change. The pipeline has to handle schema changes without requiring simultaneous deployment of all producers and consumers — coordinating synchronized deploys across multiple services is operationally expensive and error-prone.

Backward and forward compatibility rules for schema evolution:

  • Adding optional fields with defaults — backward compatible. Old consumers ignore the new field; new consumers use it. Safe.
  • Removing optional fields — forward compatible. New producers don’t send the field; old consumers don’t notice. Safe if consumers handle missing fields gracefully.
  • Renaming fields or changing types — breaking changes. Requires explicit versioning and a migration path.

Avro and Protobuf enforce compatibility rules when you register new schemas. JSON Schema can do this with a registry. Plain JSON has no enforcement — you’re relying on discipline, which doesn’t scale.

Monitoring the Right Things

Pipeline monitoring that surfaces the metrics that actually matter:

  • Consumer lag (Kafka) — how far behind is the consumer from the latest message. Lag that’s growing indicates the consumer can’t keep up. This is the early warning signal for pipeline capacity problems.
  • Processing throughput (records/second) — the rate at which the pipeline is making progress. A throughput drop without a corresponding upstream volume drop indicates a problem inside the pipeline.
  • Error rate — the percentage of messages that fail processing. A non-zero error rate requires investigation; errors in pipelines tend to be systematic rather than random.
  • Data freshness — how old is the newest data available in the downstream system. This is the metric your stakeholders actually care about and is easy to instrument.

Alert on consumer lag, error rates above threshold, and data freshness exceeding the SLA. The operational response for each is different — lag suggests scaling consumers or optimizing processing, error rates require examining failure causes, freshness issues may indicate upstream producer problems.

Our data engineering and analytics practice designs and builds pipelines across this full spectrum — from simple batch jobs to high-volume Kafka-based streaming systems. Related: if your pipeline data ends up in observability infrastructure, the DevOps and automation tooling for pipeline deployment and monitoring is closely related to the pipeline design itself.