Skip to content

Async Document Ingestion Pipeline

SQS-based ETL ingesting 1000+ docs/day via Docling OCR with PyMuPDF fallback - 99.5% success rate with adaptive memory scaling

1000+Docs/Day

Problem

The RAG chatbot needed a reliable way to ingest large volumes of enterprise documents (PDFs, scanned files, contracts) daily without blocking real-time queries, dropping documents under memory pressure, or creating stale chunks when documents were updated.

Solution

  • Built an SQS-based async pipeline where S3 document events are queued and processed independently from query serving.
  • Docling OCR handles scanned documents and images with a circuit breaker that falls back to PyMuPDF on failure.
  • ThreadPoolExecutor with 12 parallel workers maximizes throughput; adaptive memory scaling monitors heap usage and throttles batch sizes dynamically to prevent OOM.
  • AWS Bedrock Titan generates 1024-dim embeddings stored as halfvec(1024) FP16 for efficient storage.
  • Chunks upserted into Aurora pgvector with tenant isolation via JWT+DynamoDB for per-tenant access control and freshness tracking.
  • Dead-letter queues capture failed documents for inspection and replay.

System Flow

Documents

S3 Download
Docling OCR

Queue

SQS Queues
Dead Letter

Processing

Chunk Splitter
ThreadPool 12

Embedding

Titan 1024-dim
Batch Embedder

Index

Aurora pgvector
Tenant Isolation

Architecture

  • 01SQS-based async queue - S3 events decoupled from real-time query serving
  • 02Docling OCR with circuit breaker fallback to PyMuPDF for native PDFs
  • 03ThreadPoolExecutor with 12 parallel workers for throughput
  • 04Adaptive memory scaling - dynamic batch throttling under load
  • 05AWS Bedrock Titan 1024-dim embeddings → halfvec(1024) FP16 storage in Aurora pgvector
  • 06Tenant isolation via JWT+DynamoDB for per-tenant access control
  • 07Dead-letter queues for failed document inspection and replay

Impact

  • 1000+ documents ingested daily at 99.5% success rate
  • Zero impact on real-time query latency due to async architecture
  • Adaptive memory scaling prevents OOM under document load spikes
  • Full tenant isolation and freshness tracking on all chunks

Tech Stack

PythonAWS SQSPyMuPDFDocling OCRAWS BedrockTitan EmbeddingspgvectorAuroraPostgreSQLDynamoDB