Your pipeline runs perfectly on day one. By month six, it’s taking 40 minutes to load what used to take 2. Sound familiar?

This is one of the most common growing pains in data engineering — and it almost always traces back to a single early decision: how you ingest your data.

In this post, we’ll break down the two fundamental ingestion patterns — Full Load and Incremental Load — and give you a clear framework for choosing between them.

What Is a Full Load?

A full load is exactly what it sounds like: every pipeline run replaces the entire destination dataset with a fresh copy from the source.

# PySpark — Full Load pattern
input_data = spark.read.json("s3://source/devices/")

(input_data
    .write
    .format("delta")
    .mode("overwrite")
    .save("s3://destination/devices/"))

When it makes sense

  • Small, stable datasets — dimension tables, reference lists, config files
  • No reliable change indicator — if your source has no timestamp or incrementing ID, you may have no choice
  • Bootstrap scenarios — spinning up a brand new data store for the first time

Where it breaks down

Scenario Why Full Load Struggles
Dataset grows to millions of rows Runtime increases linearly with volume
Data changes every hour You reprocess 100% of data for 1% of changes
Downstream reads happen during load Consumers may see incomplete or empty data mid-run
No time travel / backup A bad run can permanently overwrite clean data

One underrated risk: data consistency during the load window. If a downstream job reads your table while you’re mid-overwrite, it may process partial data. A clean solution is to expose data through a view that always points to the last complete version:

-- Swap the view atomically after load completes
CREATE OR REPLACE VIEW devices AS
SELECT * FROM devices_v20240310;

What Is an Incremental Load?

Instead of reprocessing everything, an incremental load only picks up records that changed since the last run.

There are two main ways to detect what’s new:

1. Delta Column

A column like last_updated_at or created_at that lets you filter for recent changes.

# PySpark — Incremental Load with delta column
last_run_date = "2024-03-09"

new_records = (
    spark.read.table("source.events")
    .filter(f"last_updated_at > '{last_run_date}'")
)

new_records.write.mode("append").saveAsTable("destination.events")

2. Time-Partitioned Datasets

If your source is already partitioned by date (e.g., date=2024-03-10), you only need to load the relevant partition — no filtering required.

# Shell — Sync a single date partition between S3 buckets
aws s3 sync s3://source/events/date=2024-03-10 \
            s3://destination/events/date=2024-03-10

Side-by-Side Comparison

Criteria Full Load Incremental Load
Implementation complexity Low Medium
Runtime at scale Grows with total volume Grows with change volume only
Handles hard deletes? ✅ Naturally ❌ Requires extra logic
Late-arriving data Not an issue Risk of missing records
Backfill simplicity Just rerun Requires boundary management
Best for Small/stable datasets Large or frequently changing datasets

The One Catch With Incremental Loads

Incremental load has a blind spot: hard deletes.

If a record is physically removed from the source, your delta column won’t capture that — because the row no longer exists. Two common workarounds:

  • Soft deletes: The source marks records as is_deleted = true instead of removing them
  • Change Data Capture (CDC): A separate pattern that listens to the database commit log and captures all operations, including deletes — but that’s a topic for another post

Also watch out for late-arriving data: if events take a few hours to land in your source after they occur, a tight time window will miss them. Adding a buffer (e.g., reprocessing the last 2 hours) is a simple defensive measure.


How to Decide: A Simple Decision Tree

Does your dataset fit in a reasonable load window?
├── No → Use Incremental Load
└── Yes → Does your source have reliable change indicators?
           ├── No → Use Full Load
           └── Yes → Is data volume growing fast?
                      ├── Yes → Use Incremental Load
                      └── No → Full Load is fine for now

Key Takeaway

Full load is the right starting point for most projects. It’s simple, predictable, and easy to reason about. But as your data grows, sticking with it means accepting increasingly long runtimes and rising costs.

The skill isn’t choosing one over the other — it’s knowing when to make the switch.

Start full, instrument your pipeline runtimes, and plan the migration to incremental before the pain becomes production-critical.


Try This Today

Add a last_updated_at filter to one of your existing pipelines:

.filter(f"last_updated_at > '{previous_run_date}'")

That single line is the foundation of every incremental load. Run it in parallel with your full load for a week, compare the row counts, and you’ll have everything you need to make the case for switching.


Have you ever had to migrate a pipeline from full load to incremental mid-project? What was the tipping point — data volume, cost, or a Monday morning alert? Drop a comment below.