Your pipeline runs perfectly on day one. By month six, it’s taking 40 minutes to load what used to take 2. Sound familiar?
This is one of the most common growing pains in data engineering — and it almost always traces back to a single early decision: how you ingest your data.
In this post, we’ll break down the two fundamental ingestion patterns — Full Load and Incremental Load — and give you a clear framework for choosing between them.
What Is a Full Load?
A full load is exactly what it sounds like: every pipeline run replaces the entire destination dataset with a fresh copy from the source.
# PySpark — Full Load pattern
input_data = spark.read.json("s3://source/devices/")
(input_data
.write
.format("delta")
.mode("overwrite")
.save("s3://destination/devices/"))
When it makes sense
- Small, stable datasets — dimension tables, reference lists, config files
- No reliable change indicator — if your source has no timestamp or incrementing ID, you may have no choice
- Bootstrap scenarios — spinning up a brand new data store for the first time
Where it breaks down
| Scenario | Why Full Load Struggles |
|---|---|
| Dataset grows to millions of rows | Runtime increases linearly with volume |
| Data changes every hour | You reprocess 100% of data for 1% of changes |
| Downstream reads happen during load | Consumers may see incomplete or empty data mid-run |
| No time travel / backup | A bad run can permanently overwrite clean data |
One underrated risk: data consistency during the load window. If a downstream job reads your table while you’re mid-overwrite, it may process partial data. A clean solution is to expose data through a view that always points to the last complete version:
-- Swap the view atomically after load completes
CREATE OR REPLACE VIEW devices AS
SELECT * FROM devices_v20240310;
What Is an Incremental Load?
Instead of reprocessing everything, an incremental load only picks up records that changed since the last run.
There are two main ways to detect what’s new:
1. Delta Column
A column like last_updated_at or created_at that lets you filter for recent changes.
# PySpark — Incremental Load with delta column
last_run_date = "2024-03-09"
new_records = (
spark.read.table("source.events")
.filter(f"last_updated_at > '{last_run_date}'")
)
new_records.write.mode("append").saveAsTable("destination.events")
2. Time-Partitioned Datasets
If your source is already partitioned by date (e.g., date=2024-03-10), you only need to load the relevant partition — no filtering required.
# Shell — Sync a single date partition between S3 buckets
aws s3 sync s3://source/events/date=2024-03-10 \
s3://destination/events/date=2024-03-10
Side-by-Side Comparison
| Criteria | Full Load | Incremental Load |
|---|---|---|
| Implementation complexity | Low | Medium |
| Runtime at scale | Grows with total volume | Grows with change volume only |
| Handles hard deletes? | ✅ Naturally | ❌ Requires extra logic |
| Late-arriving data | Not an issue | Risk of missing records |
| Backfill simplicity | Just rerun | Requires boundary management |
| Best for | Small/stable datasets | Large or frequently changing datasets |
The One Catch With Incremental Loads
Incremental load has a blind spot: hard deletes.
If a record is physically removed from the source, your delta column won’t capture that — because the row no longer exists. Two common workarounds:
- Soft deletes: The source marks records as
is_deleted = trueinstead of removing them - Change Data Capture (CDC): A separate pattern that listens to the database commit log and captures all operations, including deletes — but that’s a topic for another post
Also watch out for late-arriving data: if events take a few hours to land in your source after they occur, a tight time window will miss them. Adding a buffer (e.g., reprocessing the last 2 hours) is a simple defensive measure.
How to Decide: A Simple Decision Tree
Does your dataset fit in a reasonable load window?
├── No → Use Incremental Load
└── Yes → Does your source have reliable change indicators?
├── No → Use Full Load
└── Yes → Is data volume growing fast?
├── Yes → Use Incremental Load
└── No → Full Load is fine for now
Key Takeaway
Full load is the right starting point for most projects. It’s simple, predictable, and easy to reason about. But as your data grows, sticking with it means accepting increasingly long runtimes and rising costs.
The skill isn’t choosing one over the other — it’s knowing when to make the switch.
Start full, instrument your pipeline runtimes, and plan the migration to incremental before the pain becomes production-critical.
Try This Today
Add a last_updated_at filter to one of your existing pipelines:
.filter(f"last_updated_at > '{previous_run_date}'")
That single line is the foundation of every incremental load. Run it in parallel with your full load for a week, compare the row counts, and you’ll have everything you need to make the case for switching.
Have you ever had to migrate a pipeline from full load to incremental mid-project? What was the tipping point — data volume, cost, or a Monday morning alert? Drop a comment below.