Vector Databases for RAG in Production
Vector databases are central to modern RAG systems, but many implementations fail because teams treat them as storage instead of retrieval engines.
A production-ready vector system must balance:
- retrieval quality (recall/precision)
- latency SLOs
- filtering correctness
- cost and operational stability
This article covers architecture decisions, index trade-offs, and practical operating patterns.
What a Vector Database Actually Does
At retrieval time, you:
- convert query to embedding vector
- search nearest vectors in index
- apply metadata filters and ranking
- return top-k context for downstream generation
The key challenge is approximate nearest neighbor (ANN) search under strict latency.
Exact Search vs Approximate Search
Exact nearest-neighbor gives perfect recall but is expensive at scale. ANN gives large speed gains with small recall trade-off.
For most production RAG workloads, ANN is required. The goal is not max speed or max recall alone, but best quality under latency budget.
Core ANN Index Families
HNSW (Hierarchical Navigable Small World)
Strengths:
- strong recall-latency performance
- good default for many medium/large corpora
Trade-offs:
- memory-heavy
- build time can be significant
IVF (Inverted File)
Strengths:
- scalable for very large datasets
- tunable search breadth
Trade-offs:
- requires good clustering configuration
- recall sensitive to probe settings
PQ / OPQ (Product Quantization)
Strengths:
- reduces memory footprint substantially
Trade-offs:
- compression can degrade similarity precision
Use when memory pressure is critical and slight recall drop is acceptable.
Metadata Filtering Is Not Optional
Enterprise retrieval needs filters:
- tenant_id
- access policy
- language
- date range
- document type
If filtering is weak or applied incorrectly, you risk:
- unauthorized retrieval
- low relevance due to mixed domains
- compliance incidents
Validate filter behavior with integration tests, not assumptions.
Hybrid Retrieval Pattern
Vector search alone is often insufficient. For production, hybrid retrieval is usually stronger:
- dense ANN retrieval for semantic matches
- BM25/sparse retrieval for exact terms and identifiers
- score fusion and reranking
Hybrid improves robustness on enterprise queries containing product codes, policy IDs, and rare terms.
Reranking Layer
Initial retrieval can return 50-200 candidates. A reranker reorders for precision.
Benefits:
- higher top-k relevance
- reduced context noise
- better grounded generation quality
Trade-off is additional latency. Measure marginal gain carefully.
Operational SLO Design
Define retrieval SLOs explicitly:
- P50/P95/P99 query latency
- recall@k target on gold evaluation set
- index freshness lag
- filter correctness rate
A retrieval stack with good average latency but poor tail latency can still break user experience.
Index Update Strategies
Common update modes:
- full rebuild (simple, costly)
- incremental append and periodic compaction
- streaming updates for freshness-sensitive corpora
Also design deletion behavior:
- hard delete for sensitive content
- tombstone + periodic cleanup for large pipelines
Stale or deleted documents in index are high-risk failure modes.
Capacity and Cost Planning
Plan for:
- vector dimension and memory footprint
- index replication for availability
- query throughput bursts
- embedding refresh waves
Cost usually grows faster from replication and memory than from compute. Instrument per-query cost and cache hit rates.
Evaluation Framework
Offline retrieval metrics:
- recall@k
- precision@k
- MRR/NDCG
Online metrics:
- answer acceptance rate
- citation usefulness
- escalation rate
- latency by query segment
Evaluate by query class (faq, troubleshooting, policy, long-tail) for realistic diagnosis.
End-to-End Code Example (FAISS + Metadata Filter + Hybrid Rerank)
import numpy as np
import faiss
from rank_bm25 import BM25Okapi
# Assume you already have document chunks + embeddings
chunks = [
{"id": "c1", "text": "Refunds are allowed within 30 days", "tenant": "a", "tokens": ["refund", "30", "days"]},
{"id": "c2", "text": "Proration applies on upgrade", "tenant": "a", "tokens": ["proration", "upgrade"]},
{"id": "c3", "text": "Enterprise SLA is 99.9%", "tenant": "b", "tokens": ["enterprise", "sla"]},
]
emb = np.array([
[0.12, 0.42, 0.91],
[0.10, 0.35, 0.84],
[0.95, 0.07, 0.11],
], dtype="float32")
# Normalize for cosine similarity via inner product
faiss.normalize_L2(emb)
index = faiss.IndexHNSWFlat(emb.shape[1], 32)
index.hnsw.efConstruction = 80
index.hnsw.efSearch = 64
index.add(emb)
bm25 = BM25Okapi([c["tokens"] for c in chunks])
# Mock query embedding + sparse tokens
query_vec = np.array([[0.11, 0.40, 0.89]], dtype="float32")
faiss.normalize_L2(query_vec)
query_tokens = ["refund", "days"]
# Dense retrieval
D, I = index.search(query_vec, k=5)
dense_candidates = [chunks[i] for i in I[0] if i != -1]
# Metadata filter (tenant-aware)
dense_candidates = [c for c in dense_candidates if c["tenant"] == "a"]
# Sparse scores for hybrid fusion
bm25_scores = bm25.get_scores(query_tokens)
score_map = {chunks[i]["id"]: float(bm25_scores[i]) for i in range(len(chunks))}
# Simple hybrid rerank: dense rank score + sparse score
reranked = []
for rank, c in enumerate(dense_candidates):
dense_score = 1.0 / (rank + 1)
sparse_score = score_map.get(c["id"], 0.0)
hybrid_score = 0.7 * dense_score + 0.3 * sparse_score
reranked.append((hybrid_score, c))
reranked.sort(key=lambda x: x[0], reverse=True)
top_context = [x[1] for x in reranked[:3]]
print("Top context:")
for c in top_context:
print(c["id"], c["text"])
This is a minimal skeleton. In production, add robust embedding generation, persistent storage, ACL enforcement, and evaluation harness.
Common Mistakes
- using vector-only retrieval for all query types
- no metadata filtering or weak ACL enforcement
- tuning ANN for speed only and ignoring recall
- no reranking for noisy corpora
- no index freshness and deletion policy
Key Takeaways
- Vector databases are retrieval engines, not simple storage.
- ANN index configuration and metadata filtering drive production quality.
- Hybrid retrieval + reranking is often best default for enterprise RAG.
- Evaluate with both retrieval metrics and downstream answer outcomes.
- Operate vector indexes with explicit SLOs, refresh policy, and access controls.