RAG in the Real World

Data prep and latency trade-offs.

What RAG Actually Solves

Retrieval-Augmented Generation (RAG) lets models ground answers in your content. In practice, the quality depends far more on ingestion, chunking, and ranking than on the LLM itself. Below are the patterns that ship reliably.

Ingestion & Normalization

Sources

Docs, sites, wikis, tickets, DB exports, PDFs, slides, spreadsheets.

Normalize to clean HTML/Markdown or plaintext + structural hints.
Strip nav/boilerplate; keep titles, headings, tables, and captions.
Track source_id, url, version, updated_at.

Chunking

Coherent, overlapping units improve recall without fragmenting meaning.

Semantic/heading-aware chunks (not fixed 512 tokens only).
Overlap 10–20% to preserve context across boundaries.
Attach breadcrumbs (doc title → section → sub-section).

Metadata

Filtering beats over-retrieving.

tags, product, version, locale, audience, confidentiality
“is_policy”, “is_faq”, “is_release_notes”, etc.

Embeddings & Stores

Pick an embedding model with strong multilingual capability if you need it. Store vectors alongside the raw text and metadata. Most real systems keep both BM25 keyword and vector indexes and use hybrid retrieval.

// Pseudo-config: hybrid retrieval defaults
retrieval:
  top_k_vector: 20
  top_k_keyword: 30
  fuse: reciprocal_rank  // or weighted union
  filters:
    - product: { in: ["alpha", "beta"] }
    - locale: "en"
  rerank:
    model: "cross-encoder-mini"
    keep_top_k: 8

Latency Budget (Keep UX Snappy)

Decide your target round-trip first, then budget each stage.

Target p95: 1200 ms

• Query pre-processing / rewrite      60–120 ms
• Dual retrieval (BM25 + vector)     120–250 ms
• Light rerank (cross-encoder)       120–220 ms
• LLM generation (8–16k context)     450–700 ms
• Post-processing (citations)         40–80 ms
---------------------------------------------
Total (typical)                      900–1,300 ms

Shorten prompt context (only top 6–8 passages).
Use faster models for thinking; larger model for only the final step if needed.
Enable streaming so users see tokens quickly.

Quality Patterns

Query Rewrite

Expand acronyms, add synonyms, or clarify intent before hitting the index.

Hybrid Retrieval

Combine vector semantic matches with BM25 keyword hits; fuse results.

Reranking

Cross-encoders (or LLM rankers) boost relevance → higher exactness with fewer chunks.

Answer Planning

Ask the model to outline answer sections before writing → fewer omissions.

Citations

Return sources with titles + URLs + line numbers when possible.

Fallbacks

No good hits? Ask for clarification or route to a human with a suggested reply.

Freshness & Sync

Event-driven updates (webhooks) for wikis/KBs; daily backfills for slower sources.
Use doc versioning + tombstones so stale chunks stop matching.
Prefer delta re-embeddings over full re-indexing.

Caching That Actually Helps

Passage cache: memoize retrieval for popular queries (with TTL + version keys).
Rerank cache: cache top-k results keyed by (query, filters, index_version).
Output cache: store final answer when inputs are identical and still fresh.

Safety & Governance

PII/PHI filters and policy allow/deny lists before the final answer.
Respect ACLs: filter retrieval by user/project/role.
Red-team prompts; log citations and confidence for review.

Evals You Can Trust

Golden sets: 50–200 real queries with accepted answers.
Track: groundedness, faithfulness, completeness, helpfulness, latency, cost.
Regression gate: don’t deploy if scorecards drop beyond thresholds.

scorecard:
  groundedness: >= 0.85
  faithfulness: >= 0.9
  completeness: >= 0.8
  p95_latency_ms: <= 1300
  cost_delta: <= +10%

Common Anti-Patterns

Indexing whole PDFs w/ fixed 512-token chunks (no structure, poor recall).
Retrieving 30+ passages into the prompt (latency ↑, quality ↓).
No metadata filters → irrelevant hits clobber relevance.
No citations → users can’t trust or audit the answer.
Never re-evaluating after content updates.

Need help tuning your RAG?

I design ingestion pipelines, hybrid retrieval, reranking, evals, and dashboards—so answers are fast, grounded, and trustworthy.

Design a custom GPT → Train your team →