RAG Architecture: Retrieval-Augmented Generation

RAG is one of the most practical ways to make LLM answers more factual, auditable, and domain-aware. Instead of relying only on model memory, RAG injects retrieved evidence at inference time.

Many teams think RAG is “add embeddings + query vector DB.” That is not enough. High-quality RAG is a full system with data contracts, retrieval strategy, grounding policy, evaluation, and operations.

When RAG Is the Right Pattern

RAG is especially useful when:

knowledge changes frequently
answers must cite current internal sources
hallucination risk must be reduced
model fine-tuning is expensive or slow
multiple tenants/data boundaries exist

If your problem is pure creative generation with no external knowledge dependency, RAG may add unnecessary complexity.

End-to-End RAG Pipeline

A production pipeline typically includes:

content ingestion
normalization and cleaning
chunking and metadata extraction
embedding generation
indexing
retrieval
reranking
prompt assembly with context
grounded generation
output validation and logging

Weakness in any layer can dominate final answer quality.

Ingestion and Data Contracts

RAG quality starts before embeddings. Define ingestion contracts:

source type (docs, wiki, tickets, PDFs, code)
update frequency
access control tags
deprecation and deletion behavior
schema for metadata fields

If stale or unauthorized documents enter index, downstream quality and compliance fail together.

Chunking Strategy: One of the Highest-Leverage Decisions

Chunking determines retrieval granularity.

Too small:

high lexical recall but low semantic completeness
model sees fragments without enough context

Too large:

noisy retrieval
irrelevant tokens consume context window

Practical chunking rules:

preserve semantic boundaries (heading/section/paragraph)
keep moderate overlap where cross-boundary meaning matters
include source metadata with each chunk
version chunking policy because it impacts retrieval metrics

For manuals and policy documents, section-aware chunking usually beats fixed-size split.

Embeddings and Index Design

Embedding model choice should match domain language. General embeddings can miss domain-specific terms.

Index design considerations:

ANN index type and recall-latency trade-off
metadata filters (tenant, date, document type)
incremental index updates
handling hard deletions and tombstones

A fast index with poor recall is still a low-quality RAG system.

Retrieval Strategy: Dense, Sparse, or Hybrid

Dense Retrieval

Captures semantic similarity. Works well for paraphrases and conceptual matches.

Sparse Retrieval (BM25)

Strong for exact terms, IDs, rare keywords.

Hybrid Retrieval

Combines both and often improves robustness. Especially useful for enterprise content with jargon and structured identifiers.

In practice, hybrid + reranking is often the best baseline.

Reranking for Precision

Initial retrieval may bring 50-200 candidates. Rerankers (cross-encoders or high-precision scorers) narrow to top context.

Benefits:

better context relevance
reduced token waste
improved answer grounding

Reranking adds latency, so measure quality gain versus budget.

Prompt Grounding Policy

Prompt policy should explicitly constrain behavior:

answer from provided context only
cite sources used
abstain when evidence is insufficient
separate answer from speculation

Without explicit grounding instructions, model may mix retrieved evidence with priors.

Context Window Management

Context is a scarce budget. Key controls:

max chunks per answer
metadata-prioritized ordering
duplicate/similar chunk collapse
query rewriting for better retrieval focus

A common failure is overstuffing context and degrading model attention.

Access Control and Multi-Tenant Safety

RAG must enforce retrieval-time authorization.

Required controls:

tenant-aware filtering before retrieval output
document-level ACL metadata
audit logs of retrieved sources
strict isolation tests

A single unauthorized retrieval event can be severe security incident.

Evaluation: Separate Retrieval from Generation

Do not evaluate only final answer score. Split evaluation by stage.

Retrieval evaluation:

recall@k on gold evidence sets
precision@k
MRR/NDCG where relevant

Generation evaluation:

factual correctness given retrieved context
citation faithfulness
abstention quality
hallucination rate

End-to-end failure analysis becomes actionable only when metrics are layer-specific.

Offline and Online Validation

Offline benchmark is necessary but insufficient. Also test in production-like conditions:

real user queries
long-tail and ambiguous queries
adversarial prompts
noisy OCR or malformed documents

Track online KPIs:

answer acceptance rate
citation click-through
human escalation rate
time-to-resolution in support workflows

Common Failure Modes

stale index due to weak ingestion sync
retrieval returning topically related but decision-irrelevant chunks
prompt not forcing abstain behavior
no reranking in noisy corpora
poor tenant filter enforcement
no monitoring for retrieval drift over time

Most failures are system-design failures, not “LLM intelligence” failures.

Operational Playbook

For production RAG, define routines:

daily index freshness check
weekly retrieval quality sampling
monthly prompt and chunking regression tests
incident runbook for data leak or hallucination spike
controlled rollout for embedding/model changes

RAG quality decays without active maintenance.

Example Architecture (Support Assistant)

For an internal support bot:

ingestion from product docs + incident runbooks + policy wiki
hybrid retrieval (BM25 + dense)
reranking to top 8 chunks
strict citation requirement
abstain if no evidence above confidence threshold
escalate to human support with retrieved context attached

This design usually outperforms “single prompt + base LLM” approaches on factual reliability.

End-to-End Code Example (Minimal RAG Pipeline in Python)

The following example shows a complete minimal flow:

ingest documents
chunk + metadata
TF-IDF retrieval
prompt assembly with citations

You can swap retrieval and generation components later (vector DB, reranker, hosted LLM).

import json
from dataclasses import dataclass
from typing import List, Dict

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

@dataclass
class Chunk:
    chunk_id: str
    source: str
    text: str
    acl: str  # simple tenant/access marker

def chunk_docs(raw_docs: List[Dict], max_chars: int = 450) -> List[Chunk]:
    chunks = []
    for doc in raw_docs:
        content = doc["content"].strip()
        parts = [content[i:i + max_chars] for i in range(0, len(content), max_chars)]
        for idx, part in enumerate(parts):
            chunks.append(
                Chunk(
                    chunk_id=f'{doc["doc_id"]}-{idx}',
                    source=doc["source"],
                    text=part,
                    acl=doc["acl"],
                )
            )
    return chunks

def retrieve(query: str, chunks: List[Chunk], user_acl: str, top_k: int = 5) -> List[Chunk]:
    visible = [c for c in chunks if c.acl == user_acl]
    corpus = [c.text for c in visible]
    vec = TfidfVectorizer(stop_words="english")
    mat = vec.fit_transform(corpus)
    q = vec.transform([query])
    scores = cosine_similarity(q, mat).ravel()
    top_idx = np.argsort(scores)[::-1][:top_k]
    return [visible[i] for i in top_idx]

def build_prompt(query: str, retrieved: List[Chunk]) -> str:
    context_blocks = []
    for c in retrieved:
        context_blocks.append(f"[{c.chunk_id}] ({c.source}) {c.text}")
    context = "\n\n".join(context_blocks)
    return f"""
You are a grounded assistant.
Answer only from provided CONTEXT.
If insufficient evidence, say: "I don't have enough evidence in the provided documents."
Include citations as [chunk_id].

QUESTION:
{query}

CONTEXT:
{context}
""".strip()

# Placeholder: replace with your LLM provider call
def llm_generate(prompt: str) -> str:
    return "Example grounded answer with citation [doc-1-0]."

if __name__ == "__main__":
    raw_docs = [
        {
            "doc_id": "doc-1",
            "source": "refund_policy.md",
            "acl": "tenant_a",
            "content": "Refunds are allowed within 30 days for annual plans..."
        },
        {
            "doc_id": "doc-2",
            "source": "billing_faq.md",
            "acl": "tenant_a",
            "content": "Proration is applied when upgrading plans mid-cycle..."
        },
    ]

    chunks = chunk_docs(raw_docs)
    query = "Can I get a refund after 45 days on annual plan?"
    retrieved = retrieve(query, chunks, user_acl="tenant_a", top_k=4)
    prompt = build_prompt(query, retrieved)
    answer = llm_generate(prompt)

    print("Prompt:")
    print(prompt)
    print("\nAnswer:")
    print(answer)

Use this as a base skeleton, then upgrade in order: hybrid retrieval -> reranking -> evaluation harness -> monitoring.

Key Takeaways

RAG is a system architecture, not a vector search feature.
Chunking, retrieval, reranking, and grounding policy are the highest leverage quality levers.
Secure, permission-aware retrieval is mandatory in enterprise settings.
Evaluate retrieval and generation separately for actionable debugging.
Treat RAG as a continuously operated platform, not a one-time integration.

Find posts and pages

RAG Architecture: Retrieval-Augmented Generation

When RAG Is the Right Pattern

End-to-End RAG Pipeline

Ingestion and Data Contracts

Chunking Strategy: One of the Highest-Leverage Decisions

Embeddings and Index Design

Retrieval Strategy: Dense, Sparse, or Hybrid

Dense Retrieval

Sparse Retrieval (BM25)

Hybrid Retrieval

Reranking for Precision

Prompt Grounding Policy

Context Window Management

Access Control and Multi-Tenant Safety

Evaluation: Separate Retrieval from Generation

Offline and Online Validation

Common Failure Modes

Operational Playbook

Example Architecture (Support Assistant)

End-to-End Code Example (Minimal RAG Pipeline in Python)

Key Takeaways

Further Reading

Categories

Tags

Comments

RAG Architecture: Retrieval-Augmented Generation

When RAG Is the Right Pattern

End-to-End RAG Pipeline

Ingestion and Data Contracts

Chunking Strategy: One of the Highest-Leverage Decisions

Embeddings and Index Design

Retrieval Strategy: Dense, Sparse, or Hybrid

Dense Retrieval

Sparse Retrieval (BM25)

Hybrid Retrieval

Reranking for Precision

Prompt Grounding Policy

Context Window Management

Access Control and Multi-Tenant Safety

Evaluation: Separate Retrieval from Generation

Offline and Online Validation

Common Failure Modes

Operational Playbook

Example Architecture (Support Assistant)

End-to-End Code Example (Minimal RAG Pipeline in Python)

Key Takeaways

Further Reading

Categories

Tags

Share this article

Related posts

Comments