Enterprise Multi-Tenant GenAI Chatbot
Production AWS Bedrock chatbot serving 50+ enterprise tenants - 250+ concurrent WebSocket connections at 99.9% uptime
250+Concurrent Connections
Problem
Enterprise teams needed domain-specific AI assistants that could answer from their own document corpus in real time, with strict tenant isolation, sub-second response latency, and 99.9% availability - at a cost sustainable for 50+ concurrent deployments.
Solution
- Built a FastAPI backend with WebSocket streaming to Claude Sonnet 4.5 on AWS Bedrock for real-time token delivery.
- Step router selects the optimal model across 20+ LLM steps; circuit breaker with Sonnet-to-Haiku automatic failover guarantees uptime when primary model degrades.
- 2-layer cache — Valkey (L1) for exact matches, semantic similarity (L2) for near-duplicate queries — cuts latency by 60% and API costs by 40%.
- Multi-tenant isolation at every layer: per-tenant JWT auth (JWE+JWS), RBAC, rate limiting, PII filtering at the LLM gateway, and isolated vector stores.
- Bedrock tool calling with custom tool schemas for structured data access across tenant-specific datasets.
- Full observability: structlog JSON logging, token usage tracking per model family, and per-tenant cost dashboards.
- Session state managed via DynamoDB with S3 for file and incident storage.
System Flow
Client
WebSocket Stream
FastAPI
Auth
JWT JWE+JWS
RBAC
LLM
Bedrock Claude
Step Router
Cache
Valkey L1
Semantic L2
Storage
PostgreSQL
S3 + DynamoDB
Observability
Structlog JSON
Token Tracking
Architecture
- 01Python FastAPI + WebSocket for real-time Claude Sonnet 4.5 token streaming
- 02AWS Bedrock with circuit breaker + Sonnet-to-Haiku automatic failover
- 03Step router for per-step model selection across 20+ LLM orchestration steps
- 042-layer cache: Valkey (L1 exact) + semantic similarity (L2) - 60% latency reduction
- 05Per-tenant isolation: JWT (JWE+JWS) auth, RBAC, rate limiting, PII filtering at LLM gateway
- 06Bedrock tool calling with custom tool schemas for structured data access
- 07Session management with DynamoDB, S3 for files/incidents, multi-turn context
- 08Observability: structlog JSON logging, token tracking per model family, per-tenant cost dashboards
Impact
- 250+ concurrent WebSocket connections across 50+ enterprise tenants
- 99.9% production uptime with zero cross-tenant data leaks
- 60% latency reduction and 40% API cost reduction via 2-layer caching
- Sub-second first-token latency with full response streaming
Tech Stack
PythonFastAPIWebSocketAWS BedrockClaude Sonnet 4.5ValkeyPostgreSQLDynamoDBS3JWTDockerKubernetesBedrock ToolsStructlog