Multi-Tenant Data Warehouse ETL
MWAA Airflow pipeline moving 40+ tables across 10+ tenants into Aurora PostgreSQL - 85% failure reduction, 99% data accuracy
85%Failure Reduction
Problem
Data from 10+ enterprise tenants was siloed across S3 CSV exports and Pendo analytics with inconsistent schemas. Cross-tenant analytics was impossible. Large CSVs (100MB+) caused memory failures. Pipeline failures had no isolation - one bad table took down the whole run.
Solution
- Built Apache Airflow DAGs on AWS MWAA with config-driven transformations parameterized per tenant.
- Sub-DAG triggers via TriggerDagRunOperator chain dependent workflows across tables and tenants.
- AWK splits 100MB+ CSVs into ~80MB chunks for parallel streams — 30 concurrent batch operations with Pandas ETL.
- Per-table failure isolation ensures one bad table never takes down the entire pipeline; retry logic with dead-letter handling eliminates cascading failures.
- Row validation with S3-based audit logging and email alerts on failure ensure 99% data accuracy.
- Containerized Airflow workers on K8s for elastic scaling during peak ETL windows; Aurora PostgreSQL with staging and reporting schemas as the target warehouse.
System Flow
Sources
S3 CSVs
Pendo API
Config CSVs
Orchestration
MWAA DAGs
Sub-DAG Triggers
Processing
AWK Chunking
Pandas ETL
Quality
Row Validation
S3 Audit Logs
Warehouse
Aurora PostgreSQL
K8s Pods
Architecture
- 01Apache Airflow DAGs on AWS MWAA - config-driven per-tenant parameterization
- 02Sub-DAG triggers via TriggerDagRunOperator for dependent workflow chaining
- 03AWK chunking for 100MB+ CSVs (~80MB chunks) - 30 concurrent parallel batch operations
- 04Pandas ETL with per-table failure isolation, retry logic, and dead-letter handling
- 05Row validation with S3-based audit logging and email alerts on failure
- 06Containerized workers on K8s for elastic peak scaling
- 07Aurora PostgreSQL with staging + reporting schemas
Impact
- 40+ production tables across 10+ tenants processed daily
- 85% pipeline failure reduction via chunking and per-table isolation
- 99% data accuracy across all customer instances in production
- Cross-tenant analytics unlocked for the first time
Tech Stack
PythonApache AirflowAWS MWAAAurora PostgreSQLS3PendoDockerKubernetesPandas