Skip to content

Enterprise Multi-Tenant GenAI Chatbot

Production AWS Bedrock chatbot serving 50+ enterprise tenants - 250+ concurrent WebSocket connections at 99.9% uptime

250+Concurrent Connections

Problem

Enterprise teams needed domain-specific AI assistants that could answer from their own document corpus in real time, with strict tenant isolation, sub-second response latency, and 99.9% availability - at a cost sustainable for 50+ concurrent deployments.

Solution

  • Built a FastAPI backend with WebSocket streaming to Claude Sonnet 4.5 on AWS Bedrock for real-time token delivery.
  • Step router selects the optimal model across 20+ LLM steps; circuit breaker with Sonnet-to-Haiku automatic failover guarantees uptime when primary model degrades.
  • 2-layer cache — Valkey (L1) for exact matches, semantic similarity (L2) for near-duplicate queries — cuts latency by 60% and API costs by 40%.
  • Multi-tenant isolation at every layer: per-tenant JWT auth (JWE+JWS), RBAC, rate limiting, PII filtering at the LLM gateway, and isolated vector stores.
  • Bedrock tool calling with custom tool schemas for structured data access across tenant-specific datasets.
  • Full observability: structlog JSON logging, token usage tracking per model family, and per-tenant cost dashboards.
  • Session state managed via DynamoDB with S3 for file and incident storage.

System Flow

Client

WebSocket Stream
FastAPI

Auth

JWT JWE+JWS
RBAC

LLM

Bedrock Claude
Step Router

Cache

Valkey L1
Semantic L2

Storage

PostgreSQL
S3 + DynamoDB

Observability

Structlog JSON
Token Tracking

Architecture

  • 01Python FastAPI + WebSocket for real-time Claude Sonnet 4.5 token streaming
  • 02AWS Bedrock with circuit breaker + Sonnet-to-Haiku automatic failover
  • 03Step router for per-step model selection across 20+ LLM orchestration steps
  • 042-layer cache: Valkey (L1 exact) + semantic similarity (L2) - 60% latency reduction
  • 05Per-tenant isolation: JWT (JWE+JWS) auth, RBAC, rate limiting, PII filtering at LLM gateway
  • 06Bedrock tool calling with custom tool schemas for structured data access
  • 07Session management with DynamoDB, S3 for files/incidents, multi-turn context
  • 08Observability: structlog JSON logging, token tracking per model family, per-tenant cost dashboards

Impact

  • 250+ concurrent WebSocket connections across 50+ enterprise tenants
  • 99.9% production uptime with zero cross-tenant data leaks
  • 60% latency reduction and 40% API cost reduction via 2-layer caching
  • Sub-second first-token latency with full response streaming

Tech Stack

PythonFastAPIWebSocketAWS BedrockClaude Sonnet 4.5ValkeyPostgreSQLDynamoDBS3JWTDockerKubernetesBedrock ToolsStructlog