Async Document Ingestion Pipeline

SQS-based ETL ingesting 1000+ docs/day via Docling OCR with PyMuPDF fallback - 99.5% success rate with adaptive memory scaling

1000+Docs/Day

Problem

The RAG chatbot needed a reliable way to ingest large volumes of enterprise documents (PDFs, scanned files, contracts) daily without blocking real-time queries, dropping documents under memory pressure, or creating stale chunks when documents were updated.

Solution

Built an SQS-based async pipeline where S3 document events are queued and processed independently from query serving.
Docling OCR handles scanned documents and images with a circuit breaker that falls back to PyMuPDF on failure.
ThreadPoolExecutor with 12 parallel workers maximizes throughput; adaptive memory scaling monitors heap usage and throttles batch sizes dynamically to prevent OOM.
AWS Bedrock Titan generates 1024-dim embeddings stored as halfvec(1024) FP16 for efficient storage.
Chunks upserted into Aurora pgvector with tenant isolation via JWT+DynamoDB for per-tenant access control and freshness tracking.
Dead-letter queues capture failed documents for inspection and replay.

System Flow

Documents

S3 Download

Docling OCR

Queue

SQS Queues

Dead Letter

Processing

Chunk Splitter

ThreadPool 12

Embedding

Titan 1024-dim

Batch Embedder

Index

Aurora pgvector

Tenant Isolation

Architecture

01SQS-based async queue - S3 events decoupled from real-time query serving
02Docling OCR with circuit breaker fallback to PyMuPDF for native PDFs
03ThreadPoolExecutor with 12 parallel workers for throughput
04Adaptive memory scaling - dynamic batch throttling under load
05AWS Bedrock Titan 1024-dim embeddings → halfvec(1024) FP16 storage in Aurora pgvector
06Tenant isolation via JWT+DynamoDB for per-tenant access control
07Dead-letter queues for failed document inspection and replay

Impact

1000+ documents ingested daily at 99.5% success rate
Zero impact on real-time query latency due to async architecture
Adaptive memory scaling prevents OOM under document load spikes
Full tenant isolation and freshness tracking on all chunks

Tech Stack

PythonAWS SQSPyMuPDFDocling OCRAWS BedrockTitan EmbeddingspgvectorAuroraPostgreSQLDynamoDB