Multi-Tenant Data Warehouse ETL

MWAA Airflow pipeline moving 40+ tables across 10+ tenants into Aurora PostgreSQL - 85% failure reduction, 99% data accuracy

85%Failure Reduction

Problem

Data from 10+ enterprise tenants was siloed across S3 CSV exports and Pendo analytics with inconsistent schemas. Cross-tenant analytics was impossible. Large CSVs (100MB+) caused memory failures. Pipeline failures had no isolation - one bad table took down the whole run.

Solution

Built Apache Airflow DAGs on AWS MWAA with config-driven transformations parameterized per tenant.
Sub-DAG triggers via TriggerDagRunOperator chain dependent workflows across tables and tenants.
AWK splits 100MB+ CSVs into ~80MB chunks for parallel streams — 30 concurrent batch operations with Pandas ETL.
Per-table failure isolation ensures one bad table never takes down the entire pipeline; retry logic with dead-letter handling eliminates cascading failures.
Row validation with S3-based audit logging and email alerts on failure ensure 99% data accuracy.
Containerized Airflow workers on K8s for elastic scaling during peak ETL windows; Aurora PostgreSQL with staging and reporting schemas as the target warehouse.

System Flow

Sources

S3 CSVs

Pendo API

Config CSVs

Orchestration

MWAA DAGs

Sub-DAG Triggers

Processing

AWK Chunking

Pandas ETL

Quality

Row Validation

S3 Audit Logs

Warehouse

Aurora PostgreSQL

K8s Pods

Architecture

01Apache Airflow DAGs on AWS MWAA - config-driven per-tenant parameterization
02Sub-DAG triggers via TriggerDagRunOperator for dependent workflow chaining
03AWK chunking for 100MB+ CSVs (~80MB chunks) - 30 concurrent parallel batch operations
04Pandas ETL with per-table failure isolation, retry logic, and dead-letter handling
05Row validation with S3-based audit logging and email alerts on failure
06Containerized workers on K8s for elastic peak scaling
07Aurora PostgreSQL with staging + reporting schemas

Impact

40+ production tables across 10+ tenants processed daily
85% pipeline failure reduction via chunking and per-table isolation
99% data accuracy across all customer instances in production
Cross-tenant analytics unlocked for the first time

Tech Stack

PythonApache AirflowAWS MWAAAurora PostgreSQLS3PendoDockerKubernetesPandas