RAG Architecture: Retrieval-Augmented Generation
RAG is one of the most practical ways to make LLM answers more factual, auditable, and domain-aware. Instead of relying only on model memory, RAG injects retrieved evidence at inference time.
Many teams think RAG is “add embeddings + query vector DB.” That is not enough. High-quality RAG is a full system with data contracts, retrieval strategy, grounding policy, evaluation, and operations.
When RAG Is the Right Pattern
RAG is especially useful when:
- knowledge changes frequently
- answers must cite current internal sources
- hallucination risk must be reduced
- model fine-tuning is expensive or slow
- multiple tenants/data boundaries exist
If your problem is pure creative generation with no external knowledge dependency, RAG may add unnecessary complexity.
End-to-End RAG Pipeline
A production pipeline typically includes:
- content ingestion
- normalization and cleaning
- chunking and metadata extraction
- embedding generation
- indexing
- retrieval
- reranking
- prompt assembly with context
- grounded generation
- output validation and logging
Weakness in any layer can dominate final answer quality.
Ingestion and Data Contracts
RAG quality starts before embeddings. Define ingestion contracts:
- source type (docs, wiki, tickets, PDFs, code)
- update frequency
- access control tags
- deprecation and deletion behavior
- schema for metadata fields
If stale or unauthorized documents enter index, downstream quality and compliance fail together.
Chunking Strategy: One of the Highest-Leverage Decisions
Chunking determines retrieval granularity.
Too small:
- high lexical recall but low semantic completeness
- model sees fragments without enough context
Too large:
- noisy retrieval
- irrelevant tokens consume context window
Practical chunking rules:
- preserve semantic boundaries (heading/section/paragraph)
- keep moderate overlap where cross-boundary meaning matters
- include source metadata with each chunk
- version chunking policy because it impacts retrieval metrics
For manuals and policy documents, section-aware chunking usually beats fixed-size split.
Embeddings and Index Design
Embedding model choice should match domain language. General embeddings can miss domain-specific terms.
Index design considerations:
- ANN index type and recall-latency trade-off
- metadata filters (tenant, date, document type)
- incremental index updates
- handling hard deletions and tombstones
A fast index with poor recall is still a low-quality RAG system.
Retrieval Strategy: Dense, Sparse, or Hybrid
Dense Retrieval
Captures semantic similarity. Works well for paraphrases and conceptual matches.
Sparse Retrieval (BM25)
Strong for exact terms, IDs, rare keywords.
Hybrid Retrieval
Combines both and often improves robustness. Especially useful for enterprise content with jargon and structured identifiers.
In practice, hybrid + reranking is often the best baseline.
Reranking for Precision
Initial retrieval may bring 50-200 candidates. Rerankers (cross-encoders or high-precision scorers) narrow to top context.
Benefits:
- better context relevance
- reduced token waste
- improved answer grounding
Reranking adds latency, so measure quality gain versus budget.
Prompt Grounding Policy
Prompt policy should explicitly constrain behavior:
- answer from provided context only
- cite sources used
- abstain when evidence is insufficient
- separate answer from speculation
Without explicit grounding instructions, model may mix retrieved evidence with priors.
Context Window Management
Context is a scarce budget. Key controls:
- max chunks per answer
- metadata-prioritized ordering
- duplicate/similar chunk collapse
- query rewriting for better retrieval focus
A common failure is overstuffing context and degrading model attention.
Access Control and Multi-Tenant Safety
RAG must enforce retrieval-time authorization.
Required controls:
- tenant-aware filtering before retrieval output
- document-level ACL metadata
- audit logs of retrieved sources
- strict isolation tests
A single unauthorized retrieval event can be severe security incident.
Evaluation: Separate Retrieval from Generation
Do not evaluate only final answer score. Split evaluation by stage.
Retrieval evaluation:
- recall@k on gold evidence sets
- precision@k
- MRR/NDCG where relevant
Generation evaluation:
- factual correctness given retrieved context
- citation faithfulness
- abstention quality
- hallucination rate
End-to-end failure analysis becomes actionable only when metrics are layer-specific.
Offline and Online Validation
Offline benchmark is necessary but insufficient. Also test in production-like conditions:
- real user queries
- long-tail and ambiguous queries
- adversarial prompts
- noisy OCR or malformed documents
Track online KPIs:
- answer acceptance rate
- citation click-through
- human escalation rate
- time-to-resolution in support workflows
Common Failure Modes
- stale index due to weak ingestion sync
- retrieval returning topically related but decision-irrelevant chunks
- prompt not forcing abstain behavior
- no reranking in noisy corpora
- poor tenant filter enforcement
- no monitoring for retrieval drift over time
Most failures are system-design failures, not “LLM intelligence” failures.
Operational Playbook
For production RAG, define routines:
- daily index freshness check
- weekly retrieval quality sampling
- monthly prompt and chunking regression tests
- incident runbook for data leak or hallucination spike
- controlled rollout for embedding/model changes
RAG quality decays without active maintenance.
Example Architecture (Support Assistant)
For an internal support bot:
- ingestion from product docs + incident runbooks + policy wiki
- hybrid retrieval (BM25 + dense)
- reranking to top 8 chunks
- strict citation requirement
- abstain if no evidence above confidence threshold
- escalate to human support with retrieved context attached
This design usually outperforms “single prompt + base LLM” approaches on factual reliability.
End-to-End Code Example (Minimal RAG Pipeline in Python)
The following example shows a complete minimal flow:
- ingest documents
- chunk + metadata
- TF-IDF retrieval
- prompt assembly with citations
You can swap retrieval and generation components later (vector DB, reranker, hosted LLM).
import json
from dataclasses import dataclass
from typing import List, Dict
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
@dataclass
class Chunk:
chunk_id: str
source: str
text: str
acl: str # simple tenant/access marker
def chunk_docs(raw_docs: List[Dict], max_chars: int = 450) -> List[Chunk]:
chunks = []
for doc in raw_docs:
content = doc["content"].strip()
parts = [content[i:i + max_chars] for i in range(0, len(content), max_chars)]
for idx, part in enumerate(parts):
chunks.append(
Chunk(
chunk_id=f'{doc["doc_id"]}-{idx}',
source=doc["source"],
text=part,
acl=doc["acl"],
)
)
return chunks
def retrieve(query: str, chunks: List[Chunk], user_acl: str, top_k: int = 5) -> List[Chunk]:
visible = [c for c in chunks if c.acl == user_acl]
corpus = [c.text for c in visible]
vec = TfidfVectorizer(stop_words="english")
mat = vec.fit_transform(corpus)
q = vec.transform([query])
scores = cosine_similarity(q, mat).ravel()
top_idx = np.argsort(scores)[::-1][:top_k]
return [visible[i] for i in top_idx]
def build_prompt(query: str, retrieved: List[Chunk]) -> str:
context_blocks = []
for c in retrieved:
context_blocks.append(f"[{c.chunk_id}] ({c.source}) {c.text}")
context = "\n\n".join(context_blocks)
return f"""
You are a grounded assistant.
Answer only from provided CONTEXT.
If insufficient evidence, say: "I don't have enough evidence in the provided documents."
Include citations as [chunk_id].
QUESTION:
{query}
CONTEXT:
{context}
""".strip()
# Placeholder: replace with your LLM provider call
def llm_generate(prompt: str) -> str:
return "Example grounded answer with citation [doc-1-0]."
if __name__ == "__main__":
raw_docs = [
{
"doc_id": "doc-1",
"source": "refund_policy.md",
"acl": "tenant_a",
"content": "Refunds are allowed within 30 days for annual plans..."
},
{
"doc_id": "doc-2",
"source": "billing_faq.md",
"acl": "tenant_a",
"content": "Proration is applied when upgrading plans mid-cycle..."
},
]
chunks = chunk_docs(raw_docs)
query = "Can I get a refund after 45 days on annual plan?"
retrieved = retrieve(query, chunks, user_acl="tenant_a", top_k=4)
prompt = build_prompt(query, retrieved)
answer = llm_generate(prompt)
print("Prompt:")
print(prompt)
print("\nAnswer:")
print(answer)
Use this as a base skeleton, then upgrade in order: hybrid retrieval -> reranking -> evaluation harness -> monitoring.
Key Takeaways
- RAG is a system architecture, not a vector search feature.
- Chunking, retrieval, reranking, and grounding policy are the highest leverage quality levers.
- Secure, permission-aware retrieval is mandatory in enterprise settings.
- Evaluate retrieval and generation separately for actionable debugging.
- Treat RAG as a continuously operated platform, not a one-time integration.