The AI Backend That Every Document-Heavy Team Rebuilds From Scratch
The Situation
Every organization that tries to build AI-powered search or chat on their own documents runs into the same wall. The prototype works in a weekend: embed some text, search with cosine similarity, send the results to GPT-4. Then reality sets in. The search returns confident but wrong results because cosine similarity measures vector proximity, not semantic correctness. The LLM hallucinates details that aren't in any source document. A single user runs up $47 in API charges before lunch because nobody built cost tracking into the prototype. There's no audit trail for what the system told a customer last Tuesday.
The infrastructure underneath "chat with your docs" is deceptively complex. It requires document ingestion with deduplication, chunk-level vector storage, multi-signal retrieval that catches the cases vector search misses, model routing that picks the cheapest model that can handle each task, cost tracking with per-user daily budgets, input safety checks, output verification, conversation memory, and caching. Most teams either ship the weekend prototype and spend the next year patching it, or pay for an enterprise RAG vendor that locks them into a proprietary stack.
The Cost of Doing Nothing
The build-or-buy decision for RAG infrastructure has a hidden third option: build it badly and maintain it forever. Teams that ship the weekend prototype end up with an engineer spending 8-12 hours a week patching retrieval quality, debugging hallucinated responses, and manually auditing LLM costs. At a mid-level engineering salary ($70,000-85,000/year depending on region), that maintenance burden costs roughly $16,000-20,000/year, spent on a system that was never designed for production.
The enterprise vendor route eliminates the maintenance work but introduces lock-in. Switching RAG vendors means migrating embeddings (which are model-specific and non-portable), rebuilding the ingestion pipeline, and re-testing retrieval quality from scratch. At $50,000-100,000/year for enterprise tiers, the switching cost compounds every year the system stays in place.
What I Built
An open-source AI backend that handles the full pipeline: document ingestion, hybrid retrieval, multi-model routing, agent tool-calling, guardrails, conversation memory, semantic caching, and quality evaluation. One API, twelve endpoints, no vendor lock-in.
The system accepts documents in four formats (PDF, plain text, Markdown, URLs), extracts content, splits it into 512-character chunks with 50-character overlap, generates 1536-dimensional embeddings via OpenRouter, and stores everything in PostgreSQL with pgvector. Duplicate documents are caught by SHA-256 content hashing before any embedding work happens.
Retrieval combines three signals: vector similarity (70% weight), keyword overlap (20%), and knowledge graph relationships (10%). The knowledge graph is optional: the system runs without it in production, and the scoring formula self-corrects when the graph signal is absent. I built it, validated it, and chose not to deploy it because the production VPS didn't have enough RAM for Neo4j's JVM.
Every request runs through a guardrails pipeline. Input guardrails check for prompt injection (8 weighted regex patterns, blocking above 0.8 threshold), PII (SSN, credit cards, email, phone), and denied topics. Output guardrails verify hallucination levels, content safety, and source attribution. Every LLM call is tracked by model, tokens, and cost, with a $10/day per-user budget enforced before each API call.
System Flow
Data Model
Architecture Layers
The Decision Log
| Decision | Alternative Rejected | Why |
|---|---|---|
| OpenRouter as LLM gateway | Direct OpenAI + Anthropic SDKs | One API key, 200+ models. Task-based routing (chat to GPT-4o-mini, summarization to Gemini Flash) is a config table change. If OpenRouter has an outage, swap one base URL. |
| PostgreSQL + pgvector for vectors | Pinecone / Weaviate | Documents, embeddings, conversations, cost logs, and cache entries share one database. One backup, one migration tool. Pinecone would mean a second database and a second bill. |
| Pure Python ReAct agent (90 lines) | LangGraph / LangChain | The agent calls tools in a loop with a 5-iteration cap. LangGraph pulls in graph state management, checkpoint persistence, and dozens of transitive dependencies for exactly the same loop. |
| Neo4j as optional dependency | Required graph database | Production VPS has 1GB RAM. Neo4j's JVM needs 512MB minimum. Making graph search gracefully degrade to empty results kept the production stack lean without losing the architecture for future scaling. |
| Regex-based injection detection | LLM-as-judge for prompt injection | Regex runs in microseconds across 8 weighted patterns. LLM-as-judge adds 500ms+ latency and costs money on every request. The threshold (0.8) is configurable without retraining a model. |
| Fixed-window Redis rate limiting | Token bucket / sliding window | Two Redis commands: INCR and EXPIRE. Token bucket requires Lua scripts. The precision difference at window boundaries is irrelevant at single-VPS scale. |
Results
The platform ships with 8 database tables, 12 REST API endpoints, an MCP server for external AI agent integration (Claude Desktop, Cursor), and 460+ passing tests. It deploys with docker compose up on any VPS with 1GB of RAM (without the knowledge graph) or 2GB (with Neo4j).
Instead of building document ingestion, retrieval, guardrails, and cost tracking from scratch for each engagement, the next client deployment starts from a working API with proven safety and budget controls. Ingest documents, wire a frontend, measure results. Backend work that normally takes 3-4 months of greenfield development becomes 2-3 weeks of configuration and customization.
Corpora up to ~50,000 chunks fit on a single PostgreSQL instance. Beyond that, the vector search index needs to be rebuilt for large-scale retrieval, and the VPS needs dedicated resources for the vector workload. Retrieval pipeline, model routing, and guardrails stay unchanged. The scaling constraint is storage density, not application logic.