Architectural Brief: Multi-Agent RAG Platform
One user testing a prototype hit $47 in API charges before lunch. No cost tracking, no model routing, no budget enforcement. The production system had to do two things at once: deliver accurate, grounded answers from uploaded documents, and prevent the LLM bill from outrunning the value. Every decision in this brief traces back to cost. Route queries to the cheapest model that can handle them. Track every cent per user. Cut access at $10/day. The architecture wraps 200+ models behind a single API key, combines three retrieval signals across three databases, and runs guardrails on every request.
System Topology
Infrastructure Decisions
- LLM Gateway: OpenRouter. Chose over direct OpenAI and Anthropic SDKs because one API key unlocks 200+ models. Switching from GPT-4o-mini to Claude 3.5 Sonnet is a routing table change, not a code change. The client is OpenAI-compatible, so dropping OpenRouter later requires changing one base URL.
- Vector Storage: PostgreSQL 16 with pgvector. Chose over Pinecone and dedicated vector databases because the document metadata, conversation history, cost logs, and embeddings all live in one database. One connection pool, one backup strategy, one migration tool. pgvector's IVFFlat index handles the current scale without a second service.
- Knowledge Graph: Neo4j 5 Community. Chose over property graphs in PostgreSQL because Cypher queries for multi-hop entity traversal are readable and fast. The original design treated Neo4j as a required service. First deploy to the 1GB VPS proved that wrong: the JVM alone needs 512MB, leaving nothing for the application, PostgreSQL, and Redis to share. Making graph search return empty results instead of crashing was a 15-line change that kept the architecture without the runtime cost.
- Cache Layer: Redis 7 with hiredis. Chose over application-level caching because rate limiting and session state need sub-millisecond reads shared across workers. The semantic cache itself lives in PostgreSQL (pgvector cosine similarity at 0.95 threshold), but Redis handles the rate-limit counters with INCR + EXPIRE.
- Framework: FastAPI with async SQLAlchemy. Chose over Django because every route handler is async, and the LLM calls (120-second timeout) would block synchronous frameworks. Pydantic v2 handles request validation and config management in the same type system.
- Agent Framework: Pure Python ReAct loop, 90 lines. Chose over LangGraph and LangChain because the agent needs exactly one thing: call an LLM, check for tool calls, execute them, repeat up to 5 times. A short executor does that without 40+ transitive dependencies.
- Injection Detection: 8 weighted regex patterns. Chose over LLM-as-judge because regex runs in microseconds. LLM-as-judge adds 500ms+ latency and costs money on every request. The injection threshold (0.8) is tunable without retraining.
Constraints That Shaped the Design
- Input: Documents (PDF, TXT, Markdown, URLs) uploaded via REST API, or natural language queries sent to the chat endpoint. Documents are deduplicated by SHA-256 content hash before embedding.
- Output: Grounded AI responses with source attribution, relevance scores, and per-request cost tracking. Streaming available via SSE.
- Scale Handled: Single-tenant deployment on a 1GB DigitalOcean VPS. With 512-character chunks and 1536-dimensional embeddings, the pgvector IVFFlat index stays in memory up to ~50,000 chunks. Beyond that, the index needs HNSW and likely a dedicated machine.
- Hard Constraints: $10/day LLM budget per user, enforced at the cost tracker before every call. OpenRouter rate limits handled with 3-attempt exponential backoff (1s min, 8s max). Neo4j dropped from production because 1GB RAM cannot run the app, PostgreSQL, Redis, and a JVM-based graph database simultaneously. Rate limiting uses fixed-window Redis counters (INCR + EXPIRE per api_key:path:minute) with fail-open on Redis errors.
- Operational Boundary: Deployed with Traefik reverse proxy. TLS via Let's Encrypt. No auto-scaling: if the VPS is saturated, responses slow down rather than spawn new instances.
Decision Log
| Decision | Alternative Rejected | Why |
|---|---|---|
| OpenRouter as single LLM gateway | Direct OpenAI + Anthropic SDK integration | Task-based routing saves 40-60% on LLM spend: chat queries go to GPT-4o-mini ($0.15/M input tokens), summarization to Gemini Flash ($0.075/M), evaluation to GPT-4o-mini. Direct SDK integration would mean two API clients, two billing dashboards, two sets of rate limit handling, and adding a third provider would be a code change instead of a config row. |
| Pure Python ReAct loop (90 lines) | LangGraph / LangChain agent framework | pip install langgraph pulls 43 transitive dependencies including langchain-core, pydantic v1 compatibility shims, and tenacity. Debugging a 90-line executor takes minutes. Debugging LangGraph's graph state serialization and checkpoint deserialization takes hours, for a feature that is one while loop with a counter. |
| pgvector in PostgreSQL | Pinecone / Weaviate / dedicated vector DB | Pinecone's free tier limits to 100K vectors in a single index with no metadata filtering on the Starter plan. Splitting vectors to Pinecone means two connection pools, two backup strategies, and Pinecone's Standard tier at $70/month for a single-VPS project that already stores documents, conversations, cost logs, evaluations, and cache in one transactional PostgreSQL instance. |
| Neo4j made optional (graceful degradation) | Required Neo4j as hard dependency | Evaluation harness showed 0.83 average relevance with Neo4j, 0.81 without. Two hundredths of a point for 512MB of RAM. On a 1GB VPS, that memory is worth more allocated to PostgreSQL's shared buffers than to a graph database serving a 10%-weighted signal on a sub-1,000 document corpus. |
| Redis INCR fixed-window rate limiting | Token bucket / sliding window | Token bucket requires a Lua script or atomic compare-and-swap to maintain bucket state per key. Fixed-window is two native Redis commands. Boundary imprecision (a burst spanning two windows) allows at most 2-3 extra requests per minute, which is noise at single-VPS scale where the real bottleneck is LLM response latency, not request volume. |
| Weighted hybrid retrieval (0.7 / 0.2 / 0.1) | Pure vector search / equal-weight signals | Tested all three configurations across 50 queries against the LLM-as-judge evaluation harness. Pure vector: 0.71 average relevance. Equal weights (0.33 each): 0.74. Hybrid 0.7/0.2/0.1: 0.83. The keyword signal alone rescued 6 queries where vector search returned topically adjacent but semantically wrong chunks. |