Async Document Ingestion Pipeline
SQS-based ETL ingesting 1000+ docs/day via Docling OCR with PyMuPDF fallback - 99.5% success rate with adaptive memory scaling
1000+Docs/Day
Problem
The RAG chatbot needed a reliable way to ingest large volumes of enterprise documents (PDFs, scanned files, contracts) daily without blocking real-time queries, dropping documents under memory pressure, or creating stale chunks when documents were updated.
Solution
- Built an SQS-based async pipeline where S3 document events are queued and processed independently from query serving.
- Docling OCR handles scanned documents and images with a circuit breaker that falls back to PyMuPDF on failure.
- ThreadPoolExecutor with 12 parallel workers maximizes throughput; adaptive memory scaling monitors heap usage and throttles batch sizes dynamically to prevent OOM.
- AWS Bedrock Titan generates 1024-dim embeddings stored as halfvec(1024) FP16 for efficient storage.
- Chunks upserted into Aurora pgvector with tenant isolation via JWT+DynamoDB for per-tenant access control and freshness tracking.
- Dead-letter queues capture failed documents for inspection and replay.
System Flow
Documents
S3 Download
Docling OCR
Queue
SQS Queues
Dead Letter
Processing
Chunk Splitter
ThreadPool 12
Embedding
Titan 1024-dim
Batch Embedder
Index
Aurora pgvector
Tenant Isolation
Architecture
- 01SQS-based async queue - S3 events decoupled from real-time query serving
- 02Docling OCR with circuit breaker fallback to PyMuPDF for native PDFs
- 03ThreadPoolExecutor with 12 parallel workers for throughput
- 04Adaptive memory scaling - dynamic batch throttling under load
- 05AWS Bedrock Titan 1024-dim embeddings → halfvec(1024) FP16 storage in Aurora pgvector
- 06Tenant isolation via JWT+DynamoDB for per-tenant access control
- 07Dead-letter queues for failed document inspection and replay
Impact
- 1000+ documents ingested daily at 99.5% success rate
- Zero impact on real-time query latency due to async architecture
- Adaptive memory scaling prevents OOM under document load spikes
- Full tenant isolation and freshness tracking on all chunks
Tech Stack
PythonAWS SQSPyMuPDFDocling OCRAWS BedrockTitan EmbeddingspgvectorAuroraPostgreSQLDynamoDB