Building Efficient Data Pipelines with SQL Window Functions

Window functions let you perform complex aggregations and rankings across ordered datasets without expensive joins or subqueries, making your data pipelines faster and more readable.

❌ The Problem: Complex Subqueries and Joins

You’re calculating month-over-month growth rates using three nested subqueries and a self-join. There’s a better way.

Consider this SQL to compute month-over-month sales growth:

SELECT
    curr.month,
    curr.total_sales,
    prev.total_sales AS prev_month_sales,
    ((curr.total_sales - prev.total_sales) / prev.total_sales) * 100 AS mom_growth
FROM
    (SELECT month, SUM(sales) AS total_sales FROM sales_data GROUP BY month) AS curr
JOIN
    (SELECT month, SUM(sales) AS total_sales FROM sales_data GROUP BY month) AS prev
ON curr.month = DATEADD(MONTH, 1, prev.month);

This query is hard to read and inefficient due to multiple aggregations and joins.

✅ The Solution: SQL Window Functions

Using window functions, you can simplify the above query significantly:

SELECT
    month,
    SUM(sales) AS total_sales,
    LAG(SUM(sales)) OVER (ORDER BY month) AS prev_month_sales,
    ((SUM(sales) - LAG(SUM(sales)) OVER (ORDER BY month)) / LAG(SUM(sales)) OVER (ORDER BY month)) * 100 AS mom_growth
FROM sales_data
GROUP BY month;

You can ever improve readability by using a CTE:

WITH monthly_sales AS (
    SELECT
        month,
        SUM(sales) AS total_sales
    FROM sales_data
    GROUP BY month
)
SELECT
    month,
    total_sales,
    LAG(total_sales) OVER (ORDER BY month) AS prev_month_sales,
    ((total_sales - LAG(total_sales) OVER (ORDER BY month)) / LAG(total_sales) OVER (ORDER BY month)) * 100 AS mom_growth
FROM monthly_sales;

For this precise example it can be over kill to use CTE, but in more complex queries it can help with readability and maintainability.

Key Window Functions

LAG() and LEAD(): Access data from previous or next rows without self-joins.
ROW_NUMBER(), RANK(), DENSE_RANK(): Assign unique or ranked numbers to rows within partitions.
SUM() OVER(), AVG() OVER(): Perform cumulative or moving aggregates.

Improving Performance

Always Define Frame Bounds for Aggregations

 -- ❌ SLOW: Unbounded (recalculates from start every row)
 SUM(revenue) OVER (ORDER BY date)

 -- ✅ FAST: Bounded window
 SUM(revenue) OVER (
     ORDER BY date 
     ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)

Partition Large Datasets

 -- Splits dataset into smaller chunks for parallel processing
 SELECT 
     date,
     product_category,
     revenue,
     SUM(revenue) OVER (
         PARTITION BY product_category 
         ORDER BY date
     ) as category_running_total
 FROM daily_sales;
 -- Processes each category independently ✅

Use ROWS vs RANGE Carefully

-- ROWS: Physical row offset (faster)
SUM(revenue) OVER (
    ORDER BY date 
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)

-- RANGE: Logical value range (slower, handles ties)
SUM(revenue) OVER (
    ORDER BY date 
    RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW
)

Reuse Window Definitions

 -- ❌ INEFFICIENT: Recalculates ordering multiple times
 SELECT 
     date,
     LAG(revenue) OVER (ORDER BY date),
     LEAD(revenue) OVER (ORDER BY date),
     SUM(revenue) OVER (ORDER BY date)
 FROM daily_sales;

 -- ✅ EFFICIENT: Define window once
 SELECT 
     date,
     LAG(revenue) OVER w,
     LEAD(revenue) OVER w,
     SUM(revenue) OVER w
 FROM daily_sales
 WINDOW w AS (ORDER BY date);

Real World Performance Comparisons

Scenario: Calculate 30-day metrics for 5 years of daily data (~1,825 rows)

Query A: Using LAG() only

SELECT 
  date,
  revenue,
  LAG(revenue, 30) OVER (ORDER BY date) as revenue_30d_ago,
  revenue - LAG(revenue, 30) OVER (ORDER BY date) as change_30d
FROM daily_sales;
-- Execution time: ~50ms ✅

Query B: Using SUM OVER (unbounded)

SELECT 
  date,
  revenue,
  SUM(revenue) OVER (ORDER BY date) as running_total
FROM daily_sales;
-- Execution time: ~180ms ⚠️

Query C: Using SUM OVER (bounded 30-day window)

SELECT 
  date,
  revenue,
  SUM(revenue) OVER (
      ORDER BY date 
      ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
  ) as rolling_30d_total
FROM daily_sales;
-- Execution time: ~75ms ✅

When to use each approach

Use LAG/LEAD When:

✅ Comparing specific rows (yesterday, last week, last month)
✅ Calculating period-over-period changes
✅ Finding next/previous non-null values
✅ Performance is critical ```sql -- Perfect use cases LAG(revenue, 1) OVER (ORDER BY date) as yesterday LAG(revenue, 7) OVER (ORDER BY date) as last_week LAG(revenue, 30) OVER (ORDER BY date) as last_month ```

Use SUM/AVG OVER When:

✅ Calculating running totals
✅ Computing moving averages
✅ Aggregating within sliding windows
✅ Need cumulative metrics

-- Perfect use cases
SUM(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -- running total
AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) -- 7-day MA
SUM(orders) OVER (PARTITION BY region ORDER BY date) -- regional running totals

ADVANCED: Hybrid Approach (Best Performance)

-- Combine techniques for optimal performance
WITH daily_metrics AS (
    SELECT 
        date,
        revenue,
        -- Fast LAG operations
        LAG(revenue, 1) OVER w as prev_day,
        LAG(revenue, 7) OVER w as prev_week,
        
        -- Bounded aggregations
        AVG(revenue) OVER (
            ORDER BY date 
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) as ma_7day,
        
        -- Row number (very fast)
        ROW_NUMBER() OVER w as day_number
    FROM daily_sales
    WINDOW w AS (ORDER BY date)
)
SELECT 
    *,
    revenue - prev_day as daily_change,
    revenue - prev_week as weekly_change,
    revenue - ma_7day as vs_7day_avg
FROM daily_metrics;
-- Execution time: ~60ms ✅

Bottom Line (Conclusion)

Performance Ranking (Fast → Slow):

🥇 ROW_NUMBER / RANK - Simple enumeration
🥈 LAG / LEAD - Direct row access (O(1))
🥉 SUM/AVG with bounded ROWS - Fixed window size
⚠️ SUM/AVG with RANGE - Value-based frames
🐌 SUM/AVG unbounded - Recalculates everything (O(N²))

Choose based on:

Need point comparisons? → LAG/LEAD
Need cumulative values? → SUM OVER with bounds
Need both? → Combine them efficiently with WINDOW clause