ARTENIS ALIJA
Data Pipelines · Python · Architecture6 min read10 December 2024

Data Pipeline Design Patterns That Hold Under Pressure

The patterns that separate pipelines that work from pipelines that work reliably — idempotency, observability, graceful degradation, and schema evolution.

Most data pipelines work on day one. The interesting question is how they behave on day 90, when the source changed its API, when a batch run was interrupted halfway through, when someone accidentally ran the pipeline twice. The patterns that matter are the boring ones.

Idempotency first. Every pipeline step should produce the same output given the same input, regardless of how many times it runs. This means content-based deduplication (hash the record, store the hash), not timestamp-based. Timestamps are fragile — they change, they're timezone-ambiguous, they lead to subtle double-ingestion bugs.

Observability is the second pillar. A pipeline with no metrics is an unmaintainable pipeline. I instrument every step with: records in, records out, records skipped (with reason), duration, and last successful run time. This goes to a Postgres table first, then to a dashboard if the client wants one.

Schema evolution is where pipelines go brittle. I use Pydantic models for all inter-stage data contracts and version them explicitly. When upstream schema changes, the model migration is a PR — reviewable, testable, rollbackable. Never parse raw JSON beyond the ingestion boundary.