I Built a Knowledge Graph Into the Retrieval Pipeline and Then Dropped It in Production

The vector search returned seven chunks about "database indexing strategies" for a query about "machine learning model training." All seven had cosine similarity scores above 0.72. All seven were confidently, precisely wrong.

This is the failure mode that nobody warns you about when you build a RAG system on pure vector search. Embeddings capture semantic proximity, not semantic correctness. "Database indexing" and "model training" both live in the same neighborhood of the embedding space because they co-occur in the same documents, the same blog posts, the same technical discussions. The vectors are close. The meanings are not.

I had three options. Fine-tune the embedding model (expensive, slow, and the problem would resurface with every new document domain). Raise the similarity threshold from 0.7 to 0.85 (which would kill recall on legitimate queries). Or add a second retrieval signal that doesn't rely on vector proximity at all.

I chose the third option, and then I added a third signal on top of that. The final retrieval pipeline combines three independent scores: vector similarity at 70%, keyword overlap at 20%, and knowledge graph relevance at 10%. Each signal catches failures the other two miss.

Why Three Signals

Vector search is good at finding semantically similar content. It fails when two topics share vocabulary without sharing meaning. "Python performance optimization" and "Python snake habitat" both contain "Python." A keyword system surfaces both while a vector system might correctly separate them, or might not, depending on the embedding model's training data.

Keyword overlap is the brute-force check. If the user typed "machine learning" and the chunk contains those exact words, that is a direct signal that no embedding model can argue with. It catches the cases where vector similarity drifts into adjacent topics. The scorer calculates term overlap as a ratio:

keyword_score = len(query_terms & candidate_terms) / len(query_terms)

I chose this over BM25 because the keyword signal is a correction factor, not the primary ranker. At 20% weight, the precision difference between bag-of-words and BM25 vanishes into the final score. Adding BM25 would have meant maintaining a separate text search index alongside pgvector. One more index, one more failure mode, for a signal that carries a fifth of the weight.

The knowledge graph was supposed to be the sophistication layer. During document ingestion, the pipeline extracts named entities (people, organizations, concepts) from each chunk and stores them as nodes in Neo4j, linked by EXTRACTED_FROM relationships back to their source documents. When a query arrives, the system identifies entity-like terms (capitalized words longer than one character) and queries Neo4j for documents connected to those entities. If the user asks about "PostgreSQL indexing," the graph can surface chunks that mention PostgreSQL even if the embedding vectors didn't rank them high enough.

The Weight Problem

Picking the weights was the part I got wrong twice.

The first version used equal weights: 0.33 for each signal. Retrieval quality dropped. Vector similarity is genuinely the best single signal for semantic search, and diluting it to 33% meant that two mediocre keyword matches could outrank a strong semantic hit. A query about "async database sessions in Python" returned chunks that happened to contain all four words scattered across an unrelated paragraph, beating a chunk that discussed the exact concept using "asynchronous" instead of "async."

The second version over-corrected: 0.9 vector, 0.05 keyword, 0.05 graph. Barely different from pure vector search. The correction signals were so quiet they couldn't override a bad vector match.

The final weights came from testing against the evaluation harness. The platform runs three LLM-as-judge scorers on every response: relevance (do the chunks match the query?), faithfulness (is the response grounded in the chunks?), and correctness (did it actually answer the question?). I ran the same 50 queries through all three weight configurations and compared the average evaluation scores.

final_score = 0.7 * vector_score + 0.2 * keyword_score + 0.1 * graph_score

At 0.7/0.2/0.1, the average relevance score improved from 0.71 (pure vector) to 0.83 (hybrid). The keyword signal caught the vocabulary drift cases. The graph signal helped occasionally but never moved the needle by more than a few points on its own.

Which is exactly why I dropped it in production.

The Production Constraint

The platform runs on a DigitalOcean VPS with 1GB of RAM. The FastAPI application, PostgreSQL with pgvector, and Redis all need to fit in that envelope. Neo4j is a JVM application. The JVM alone wants 512MB of heap space before you store a single node. Running four services on 1GB means none of them have room to breathe.

I had a choice: upgrade to a 2GB VPS for the graph, or architect the system so the graph layer is optional.

I chose optional.

The graph search module catches all connection errors and returns an empty result list instead of crashing. The reranker doesn't care whether the graph score is zero because a query didn't match any entities, or zero because Neo4j isn't running. It computes the weighted sum either way:

# When Neo4j is available:
final_score = 0.7 * vector_score + 0.2 * keyword_score + 0.1 * graph_score

# When Neo4j is down:
final_score = 0.7 * vector_score + 0.2 * keyword_score + 0.1 * 0.0

The weights don't re-normalize. They don't need to. Ranking order stays the same whether the graph contributes meaningful signal or zeros. The top result is still the top result. Absolute scores drop, but since I only sort by rank and don't threshold on the combined score, it doesn't matter.

What surprised me was that the evaluation scores barely moved. With Neo4j running locally, the average relevance across 50 test queries was 0.83. Without Neo4j, it dropped to 0.81. Two hundredths of a point. For a document corpus under 1,000 items, the knowledge graph is architecture for the future, not value for today. The entity relationships just aren't dense enough to meaningfully reshape the rankings.

This would change at scale. At 50,000 documents, the embedding space gets crowded. Vector similarity starts returning more near-misses because the neighborhood density increases. That's when the graph becomes the tiebreaker: not "does this chunk mention the topic?" but "does this chunk discuss the topic in the context of entities the user has been asking about?" The architecture supports it. The current deployment doesn't need it.

The Design Principle

I've started treating optional components as a first-class architectural pattern. Not every service in the topology needs to be a hard dependency. The question isn't "does this component add value?" The question is "does the system still function without it?"

For the knowledge graph, the answer was yes. For the semantic cache, also yes: if Redis goes down, the cache misses and every query hits the LLM directly. More expensive, but correct. The rate limiter also fails open: if Redis is unavailable, requests pass through without rate checks. Worse for cost, but the system stays up.

The guardrails pipeline is the one place I didn't apply this pattern. If the input guardrails can't run (injection detection, PII scan, topic policy check), the request fails. Silently processing a potentially injected prompt because the guardrail service was down is worse than returning an error. Safety is a hard dependency. Performance optimization is not.

That distinction ended up shaping the entire system's resilience model. And it started with a knowledge graph that was worth 10% of a retrieval score and zero percent of the production memory budget.

Why Three Signals

The Weight Problem

The Production Constraint

The Design Principle

Put this system in context.

Contents