Enterprise Multi-Tenant GenAI Chatbot

Production AWS Bedrock chatbot serving 50+ enterprise tenants - 250+ concurrent WebSocket connections at 99.9% uptime

250+Concurrent Connections

Problem

Enterprise teams needed domain-specific AI assistants that could answer from their own document corpus in real time, with strict tenant isolation, sub-second response latency, and 99.9% availability - at a cost sustainable for 50+ concurrent deployments.

Solution

Built a FastAPI backend with WebSocket streaming to Claude Sonnet 4.5 on AWS Bedrock for real-time token delivery.
Step router selects the optimal model across 20+ LLM steps; circuit breaker with Sonnet-to-Haiku automatic failover guarantees uptime when primary model degrades.
2-layer cache — Valkey (L1) for exact matches, semantic similarity (L2) for near-duplicate queries — cuts latency by 60% and API costs by 40%.
Multi-tenant isolation at every layer: per-tenant JWT auth (JWE+JWS), RBAC, rate limiting, PII filtering at the LLM gateway, and isolated vector stores.
Bedrock tool calling with custom tool schemas for structured data access across tenant-specific datasets.
Full observability: structlog JSON logging, token usage tracking per model family, and per-tenant cost dashboards.
Session state managed via DynamoDB with S3 for file and incident storage.

System Flow

Client

WebSocket Stream

FastAPI

Auth

JWT JWE+JWS

RBAC

LLM

Bedrock Claude

Step Router

Cache

Valkey L1

Semantic L2

Storage

PostgreSQL

S3 + DynamoDB

Observability

Structlog JSON

Token Tracking

Architecture

01Python FastAPI + WebSocket for real-time Claude Sonnet 4.5 token streaming
02AWS Bedrock with circuit breaker + Sonnet-to-Haiku automatic failover
03Step router for per-step model selection across 20+ LLM orchestration steps
042-layer cache: Valkey (L1 exact) + semantic similarity (L2) - 60% latency reduction
05Per-tenant isolation: JWT (JWE+JWS) auth, RBAC, rate limiting, PII filtering at LLM gateway
06Bedrock tool calling with custom tool schemas for structured data access
07Session management with DynamoDB, S3 for files/incidents, multi-turn context
08Observability: structlog JSON logging, token tracking per model family, per-tenant cost dashboards

Impact

250+ concurrent WebSocket connections across 50+ enterprise tenants
99.9% production uptime with zero cross-tenant data leaks
60% latency reduction and 40% API cost reduction via 2-layer caching
Sub-second first-token latency with full response streaming

Tech Stack

PythonFastAPIWebSocketAWS BedrockClaude Sonnet 4.5ValkeyPostgreSQLDynamoDBS3JWTDockerKubernetesBedrock ToolsStructlog