Retrieval-Augmented Generation: An Evidence Review of Archit

Scope and Motivation

RAG works in demos. Nobody disputes that anymore. The harder question: does it hold together when the stakes are real, when retrieved context is noisy, when evaluation criteria fragment across three different teams with three different dashboards, and when nobody has agreed on what “production-ready” actually means for a system that can hallucinate with citations attached?

In my experience building retrieval pipelines, the failure mode is almost never “the model cannot generate an answer.” It is subtler. Worse, really. The model generates a confident answer from the wrong retrieved context, and nobody catches it until a user complains. Sometimes not even then. The papers reviewed here expose exactly where that gap originates; the answer is not a single point of failure but a constellation of them: retrieval noise sensitivity, evaluation fragmentation, domain-specific safety requirements, and the open question of whether retrieval alone suffices or must be fused with fine-tuning.

One question drives this review: If you are building a RAG system for production use, what should your retrieval pipeline look like, how should you evaluate it, where will it likely fail, and what does the current evidence actually support?

Critical caveat: RAG reduces certain categories of hallucination by grounding generation in retrieved evidence, but it does not eliminate them. A RAG system can still return wrong, incomplete, or misleading answers when retrieval misses the best evidence, when retrieved documents contradict each other, or when the generator overstates confidence in thin context. Citation presence in a generated response does not guarantee truthfulness: the cited output may misrepresent its sources when retrieved context is ambiguous or sparse.

Method

Each paper was read as an engineering input, not a theoretical endpoint. Four dimensions were extracted from every source: what the authors assert (core claim), the underlying technical architecture or algorithmic change (supporting mechanism), how robust the evaluation framework and datasets are (evidence quality), and what this means for production system architecture (implementation implication).

Papers were then compared along shared axes: retrieval method, augmentation strategy, evaluation approach, and deployment readiness. Contradictions received special attention. Why? Because when a survey recommends one approach and an empirical paper demonstrates its failure under controlled conditions, that disagreement is more informative than either paper alone.

Working Definitions

Retrieval-Augmented Generation (RAG): A system architecture that supplements the parametric knowledge of a language model with information retrieved from an external knowledge base at inference time, reducing hallucination and improving factual accuracy.
Dense Retrieval: A retrieval method that uses neural network encoders (e.g., embedding models) to map queries and documents into a shared vector space, enabling semantic similarity matching beyond literal keyword overlap.
Sparse Retrieval: A retrieval method based on term frequency statistics (e.g., BM25), which matches documents to queries through exact lexical overlap rather than semantic similarity.
Distracting Document: A retrieved document that is semantically similar to the query (often scoring highly in vector space) but does not contain the correct answer. Empirically shown to degrade LLM accuracy more than completely random noise.
Generative Information Retrieval (GenIR): An emerging IR paradigm where models directly generate document identifiers or user-centric responses from internal parameters (e.g., Differentiable Search Indices) rather than searching an external, discrete index.

What Each Paper Contributes in Practice

Kimothi (2025): The Architectural Primer

Kimothi’s practitioner guide splits RAG into two distinct workloads: an offline indexing pipeline (source connection, extraction, chunking, embedding, storage) and a real-time generation pipeline (query processing, retrieval, augmentation, LLM response) [1]. Clean separation. The two-pipeline model is pedagogically effective and maps directly to how software engineering teams actually organise themselves.

The book introduces a useful RAG maturity progression:

Naïve RAG Advanced RAG Modular RAG

This progression helps engineering teams calibrate architectural investments against measured evaluation outcomes rather than over-engineering prematurely.

Study limitations: This is a practitioner guide, not peer-reviewed empirical research. No experiments. No benchmarks. No measured datasets support the recommendations. The production deployment discussion is conceptual, not validated against measured outcomes, and failure modes, adversarial retrieval, and noise sensitivity go entirely unaddressed.

For engineering teams: treat this as orientation material. The architectural vocabulary is useful; the deployment advice is not.

Zhao et al. deliver a comprehensive RAG survey covering text, code, audio, images, video, 3D, and scientific applications [2]. Their key taxonomic contribution is a four-paradigm classification of how retrieved results interact with the generator:

Augmentation Paradigm	Mechanism	Typical Use Case
Input Augmentation	Retrieved content is prepended/appended to generator text input	Standard question answering
Latent-Representation Fusion	Retrieved embeddings are merged at intermediate hidden layers	Cross-modal generation (text-to-image)
Logits-Level Augmentation	Retrieval scores directly influence output token probability distributions	$k\text{NN-LM}$ style approaches
Step-Skipping Augmentation	Retrieval results completely replace or bypass specific generation steps	Template-based deterministic generation

Table 1. Four augmentation paradigms from Zhao et al. (2026), classified by how retrieval results interact with the generator.

Study limitations: This is a taxonomic survey, not an experimental study. The breadth across modalities (text, code, audio, images, video, 3D) comes at the cost of depth: individual techniques receive brief treatment. Text-specific nuances such as distractor sensitivity are not explored. Some cited works are recent preprints with limited independent validation.

The taxonomy is most useful during architectural review: classifying the paradigm of your existing system helps identify whether alternative augmentation approaches are worth prototyping.

Huang and Huang (2026): The IR-Centric Pipeline Guide

Published in ACM Computing Surveys, Huang and Huang organise RAG into four processing phases from an information retrieval perspective [3]. This phase decomposition is highly actionable because it maps directly to discrete microservices or pipeline components:

Pre-retrieval: Query expansion, hypothetical document embeddings (HyDE), reformulation, and index routing.
Retrieval: Execution of sparse (BM25), dense (DPR, Contriever), or hybrid methods.
Post-retrieval: Re-ranking (via Cross-Encoders), metadata filtering, and context compression/summarization.
Generation: Prompt construction, iterative generation, and output verification/guardrailing.

Key Finding: Hybrid retrieval (sparse + dense) coupled with a re-ranking step consistently outperforms either method alone across most public benchmarks. The cost and accuracy implications for pipeline design are substantial.

Study limitations: This is a survey paper, not a primary experiment. The hybrid retrieval superiority claim is synthesised from others’ reported benchmarks, not independently replicated. Text-only scope; multimodal RAG teams must supplement with other sources. Failure mode analysis and adversarial robustness receive limited treatment.

Bottom line: this phase decomposition maps cleanly to pipeline components. If you are building a text-domain RAG system, start here.

Cuconasu et al. (2024): The Counter-Intuitive Retrieval Evidence

This SIGIR 2024 paper provides the most surprising and critical empirical findings for production systems [4]. Through rigorous experimentation across multiple open-weight LLMs (Llama2, MPT, Phi-2, Falcon), the authors demonstrate three major anomalies:

Finding	Evidence / Setup	Magnitude of Effect	Production Implication
Distractors Degrade Accuracy	Adding 1 semantically similar non-answer document	Up to −25% accuracy	High vector-similarity scores do not guarantee beneficial context.
Random Noise Can Help	Adding completely random documents near the query	Up to +35% accuracy (Llama2, 12 random docs)	Weak noise may serve as an attention regularizer, preventing model hallucination.
Position Matters Intensely	“Gold” document placed near the query vs. far away	Up to 20% accuracy gap	Always position your highly verified contexts adjacent to the prompt query.

Table 2. Key empirical findings from Cuconasu et al. (2024), showing retrieval document type and position effects on LLM accuracy.

📊 Key Statistic: A single distracting document, one that scores highly in dense retrieval but does not contain the answer, can reduce LLM accuracy by 25%. With 18 distractors, accuracy degrades by up to 67%.

So much for the naive assumption that higher retrieval recall automatically correlates with better RAG performance. The practical implication is blunt: post-retrieval filtering to remove high-scoring distractors matters significantly more than maximising initial retrieval recall.

Study limitations: Experiments used the NQ-open dataset only; generalisation to other QA benchmarks and non-QA tasks (summarisation, dialogue, multi-hop reasoning) is unverified. All models tested at 7B scale or smaller (2.7B-7B) with 4-bit quantisation; behaviour at larger scales or different quantisation levels may differ. The hypothesis that random noise acts as an attention regulariser is plausible but not mechanistically proven.

This paper provides the strongest empirical warrant in the entire corpus for one specific engineering decision: implement cross-encoder distractor filtering before you optimise anything else in the pipeline.

Amugongo et al. (2025): The Healthcare Reality Check

This PRISMA-compliant systematic review maps RAG applications in clinical healthcare and identifies four severe industry-wide blind spots [5]:

Language Bias: 78.9% of healthcare RAG studies rely exclusively on English datasets, while 21.1% use Chinese. No other languages are significantly represented.
Proprietary Dependency: GPT-3.5 and GPT-4 dominate research, raising massive data privacy, compliance (HIPAA), and reproducibility concerns in clinical settings.
Evaluation Fragmentation: There is zero standardization for healthcare RAG evaluation frameworks, making cross-study safety comparison nearly impossible.
Ethics Deficit: The majority of reviewed clinical studies completely omit ethical considerations or bias audits.

Study limitations: This is a descriptive systematic review, not an empirical benchmark. The review period (January 2020-February 2025) may miss recent advances. The English-language-only inclusion criterion creates a meta-level bias that mirrors the very language-gap finding. The majority of reviewed studies do not themselves assess ethical considerations, so the ethics gap finding is observational rather than experimentally measured.

What does this mean for teams deploying RAG in regulated domains (medical, legal, financial)? General-purpose metrics like RAGAS are necessary but insufficient. You need domain-specific safety, equity, and alignment evaluations built alongside, not bolted on later.

Li et al. (2025): The Generative IR Evolution Map

Li et al. place RAG within a broader evolutionary continuum of information retrieval:

Sparse Retrieval Dense Retrieval Generative Retrieval (GR) Reliable Response Generation

Their survey covers Generative Retrieval (GR), where models internalize document identifiers natively within their parameters [6]. However, the authors note that while RAG and GR are structurally complementary, GR suffers from an inability to scale or update dynamically without expensive parameter retraining.

Study limitations: Broad scope means RAG-specific depth is limited. Generative retrieval techniques remain largely experimental with no demonstrated production-scale viability. Some cited techniques are recent preprints with limited independent validation.

Generative retrieval is worth monitoring. But today? Its inability to scale or update without full retraining makes it unsuitable for production environments where the knowledge base changes frequently.

Meng et al. (2025): The Fusion Strategy Pattern

Meng et al. demonstrate that combining RAG with parameter-efficient fine-tuning (PEFT) produces far superior domain-specific generation than relying on either technique in isolation [7]. Their core architectural pattern establishes a clear division of labor: Retrieval provides dynamic, up-to-date context; fine-tuning adapts the tone, syntax, and structural constraints of the model.

PEFT Method	Underlying Mechanism	Best Production Use Case
Adapter-Tuning	Inserts small trainable layers within existing Transformer blocks	Fast task adaptation with minimal parameter overhead.
LoRA	Injects low-rank decomposition matrices into attention weights	General-purpose domain adaptation with excellent compute efficiency.
QLoRA	Applies LoRA over a frozen, 4-bit quantized base model	Minimizing VRAM footprints for consumer-grade hardware deployment.
Prefix-Tuning	Prepends trainable continuous vectors to attention keys/values	Lightweight multi-task switching without changing base weights.

Table 3. Parameter-efficient fine-tuning methods from Meng et al. (2025), with practical selection guidance.

Study limitations: Short conference paper format limits depth. System evaluations are reported briefly with sparse experimental methodology. The 90%+ accuracy claim comes from a Chinese medicine Q&A system and is not independently validated. The comparative analysis is descriptive rather than rigorous benchmarking. Generalisation beyond Chinese-language implementations is assumed but not demonstrated.

The practical conclusion is straightforward: RAG and fine-tuning are not competing alternatives. They solve different problems. For vertical applications, retrieval provides dynamic context while LoRA or QLoRA adapts model behaviour to domain conventions. Combine both unless resource constraints force a hard choice.

Cross-Paper Patterns: Five Recurring Themes

Retrieval quality is the primary bottleneck. Not generation. Not prompt engineering. Retrieval. Downstream generation quality is bounded by retrieval precision, and optimising prompt templates while ignoring retrieval noise, distractor contamination, and context positioning produces systems that look functional in demos and shatter under real workloads. I have personally watched teams spend weeks tuning generation temperature and system prompts when their real problem was that 40% of retrieved chunks were irrelevant. The fix was not a better prompt. It was better retrieval.
Not all retrieved context is helpful; some is actively harmful. This is the counter-intuitive finding that matters most. The experiments of Cuconasu et al. on NQ-open show that distracting documents (semantically similar, high-scoring, but answer-free) degrade accuracy more than purely random noise. More retrieval is not automatically better retrieval, though generalisation to non-QA tasks and larger models remains untested.
Evaluation must separate retrieval from generation. Retrieval performance (MRR, NDCG, Recall) and generation performance (faithfulness, correctness) measure independent failure modes. Conflating them produces a dashboard that says “good” while one subsystem quietly degrades. The hardest part is not building separated metrics; it is convincing stakeholders that a correct final answer does not prove retrieval worked correctly. It might have worked despite bad retrieval, by luck.
Domain-specific deployment requires domain-specific safety. General-purpose RAG benchmarks will not catch clinical misdiagnosis, financial mispricing, or legal liability. Amugongo et al. document this gap for healthcare with uncomfortable specificity: 78.9% English-only datasets, zero standardised evaluation frameworks, and majority ethics omissions. Analogous evidence for legal and financial domains is absent from this corpus, which is itself a gap worth noting.
RAG and fine-tuning appear complementary, with caveats. Meng et al. report that retrieval plus parameter-efficient fine-tuning outperforms either technique alone in their Chinese medicine Q&A system. The fusion pattern is architecturally sound. But the empirical evidence is limited to a single domain with sparse methodological detail, and the 90%+ accuracy claim has not been independently replicated. Treat this as a plausible design direction, not a settled best practice.

Source Reliability Assessment

Paper Source	Document Type	Production Confidence	Key Limitation	Core Application Rule
Kimothi (2025)	Practitioner Guide	Medium (Architecture patterns)	No empirical validation; pedagogical only	High-level mental model and team boundary organization.
Zhao et al. (2026)	Peer-Reviewed Survey	High (Taxonomic frameworks)	Breadth over depth; text-specific nuances underexplored	Classifying advanced multi-modal augmentation strategies.
Huang & Huang (2026)	Peer-Reviewed Survey (ACM)	High (Pipeline execution)	Survey synthesis, not primary replication; text-only scope	Primary architectural guide for text-domain pipeline phases.
Cuconasu et al. (2024)	Peer-Reviewed Empirical (SIGIR)	High (Optimization data)	NQ-open only; ≤7B models; 4-bit quantisation; QA tasks only	Core justification for post-retrieval filtering & re-ranking.
Amugongo et al. (2025)	Peer-Reviewed Systematic Review	High (Risk mitigation)	Descriptive, not experimental; English-only inclusion criterion	Defining strict domain safety compliance metrics.
Li et al. (2025)	Peer-Reviewed Survey (ACM)	High (Theoretical evolution)	Broad scope limits RAG-specific depth; GR remains experimental	Long-term roadmap planning; warning against early GR adoption.
Meng et al. (2025)	Peer-Reviewed Conference Paper	Medium (Design patterns)	Chinese-language only; sparse methodology; single-domain validation	Implementing RAG + PEFT dual-engine setups.

Table 4. Evidence confidence map across the reviewed papers, including key limitations and practical reading guidance for engineering teams.

Practical Design Guidance for Teams

1. Structure Your Code Around the Four-Phase Architecture

Isolate your system modules into Pre-Retrieval, Retrieval, Post-Retrieval, and Generation services. Tuning LLM generation parameters to fix poor upstream retrieval quality is a systemic anti-pattern.

2. Implement Hybrid Retrieval + Re-ranking as a Baseline

Do not rely solely on dense vector databases. Combine dense embeddings with lexical BM25 search using Reciprocal Rank Fusion (RRF). Critically, pass the top results through a Cross-Encoder Re-ranker model. The cross-encoder is your primary defence against the harmful distractors highlighted by Cuconasu et al. [3].

3. Enforce Strict Context Positioning Rules

When assembling your final LLM prompt context window, programmatically sort your documents so that the most relevant, highest-confidence sources are placed directly adjacent to the user query [4]. This is a zero-cost optimization with measurable accuracy benefits.

4. Separate Your Metrics

Maintain completely separate evaluation dashboards:

Retrieval Metrics: Hit Rate, Recall@K, Mean Reciprocal Rank (MRR).
Generation Metrics: Faithfulness (groundedness), Answer Relevance, and Semantic Correctness.

When the system underperforms, this separation tells you whether retrieval or generation is at fault.

5. Combine RAG with Fine-Tuning for Domain-Specific Applications

For vertical deployments (healthcare, legal, finance), RAG alone may not adapt the generation style of the model sufficiently. Add LoRA or QLoRA fine-tuning on domain-specific data to bridge the gap between generic generation and domain-appropriate responses [7].

6. Add Domain-Specific Safety Gates for Critical Applications

For healthcare and similarly critical domains, add human oversight, bias auditing, explainability requirements, and multilingual evaluation before deployment [5]. General-purpose RAG evaluation metrics do not capture clinical safety.

New Knowledge and Skills from the Combined Corpus

A maturity shift is visible in the evidence. Early RAG adoption chased recall: retrieve more documents, provide more context, hope the model sorts it out. That approach fails. Badly. The evidence now points toward retrieval precision and context quality as the performance drivers that actually matter, though this conclusion rests primarily on the single-dataset experiments of Cuconasu et al., corroborated by survey-level recommendations rather than broad independent replication.

Teams that build reliable RAG systems tend to converge on five capabilities early: hybrid retrieval engineering combining sparse and dense methods with cross-encoder re-ranking; distractor detection and filtering using answer-presence verification and confidence thresholds; context positioning discipline placing highest-confidence documents nearest the query boundary; separated evaluation pipelines measuring retrieval quality (MRR, Recall@K) independently from generation quality (faithfulness, correctness); and domain safety integration adding ethics, equity, explainability, and compliance checks for critical applications. Skip any one of these and the others compensate poorly. All five interact.

Questions on RAG Architecture

What is the most important finding from this RAG evidence review?

The distractor effect. Cuconasu et al. discovered that semantically similar documents which do not contain the answer degrade LLM accuracy more than completely random documents [4]. That finding inverts a widespread assumption: high retrieval scores do not guarantee helpful context. They can guarantee the opposite.

Should I use dense retrieval or sparse retrieval for my RAG system?

Both. Neither alone is sufficient. Huang and Huang’s survey finds that hybrid retrieval (sparse BM25 combined with dense methods like DPR or Contriever) consistently outperforms either in isolation [3]. The reason is straightforward: BM25 catches exact terminology that embedding models miss; dense retrieval captures semantic relationships that keyword matching cannot.

How should I evaluate the quality of my RAG system?

Never evaluate retrieval and generation together. Measure retrieval with precision, recall, and MRR. Measure generation separately with accuracy, faithfulness, and relevance. Why insist on this separation? Because a correct final answer can mask broken retrieval. The model might have guessed correctly despite receiving irrelevant context. Without separated metrics, you cannot distinguish luck from engineering [1] [3].

Is RAG sufficient on its own, or should I also fine-tune my model?

For general knowledge tasks, RAG alone can be effective. For domain-specific applications (healthcare, legal, finance), combining RAG with parameter-efficient fine-tuning produces better results. Meng et al. show that the fusion pattern, where retrieval provides current context and fine-tuning adapts generation style, reaches 90%+ accuracy in domain-specific Q&A [7].

Why do random documents sometimes improve RAG accuracy?

Cuconasu et al. hypothesise that random documents act as an attention regularisation mechanism [4]. When only one gold document is present, the LLM may over-attend to any semantically similar content. Random noise reduces this over-reliance by distributing attention, potentially helping the model focus more carefully on the genuinely relevant passage. The mechanism is hypothesised, not mechanistically proven.

What are the biggest risks when deploying RAG in healthcare?

Amugongo et al. identify four: language bias (78.9% English-only datasets), proprietary model dependency (GPT-3.5/4 dominance), evaluation fragmentation (no standard framework), and ethics gaps (most studies omit ethical considerations) [5]. Teams deploying healthcare RAG must address all four to meet clinical safety requirements.

How does generative information retrieval (GenIR) relate to RAG?

RAG and GenIR are complementary strategies. RAG augments generation with retrieved external knowledge using explicit indexes. GenIR replaces index-based retrieval with parametric memory: models directly generate document identifiers or responses from their parameters [6]. Production systems may eventually combine both, but GenIR remains largely experimental.

What retrieval document positioning gives the best RAG accuracy?

Place the most relevant document adjacent to the query in the prompt. Cuconasu et al. show that “near” positioning (relevant document closest to query) consistently outperforms “mid” (middle of context) and “far” (beginning of context) placements across all tested LLMs [4]. This confirms the “lost in the middle” effect from prior research.

What is the RAG maturity progression and where should my team start?

Kimothi describes three maturity levels: Naïve RAG (basic retrieve-and-generate), Advanced RAG (query rewriting, re-ranking, iterative retrieval), and Modular RAG (composable pipeline with pluggable components) [1]. Start with Naïve RAG, measure evaluation metrics, and progress only when evidence from those metrics justifies the added complexity.

Can I use this evidence review as the sole basis for my RAG architecture?

No. This review is strong for identifying retrieval pipeline priorities, evaluation strategies, and failure modes, but its empirical depth is concentrated in a single study (Cuconasu et al.) using one dataset at small model scales. Final architecture decisions should follow measured outcomes from your own domain-specific evaluation, including retrieval quality, generation faithfulness, and domain safety requirements. Use this synthesis as a starting map, not a destination.

Technical Appendix

Corpus, Evidence Limits, Citability Metrics, and Technical Definitions

Appendix Table of Contents

Author and Source Credibility
A. Citability Snapshot and Decision Metrics
B. Authoritative Baselines
C. Technical Term Definitions
D. Corpus Reviewed
E. Evidence Maturity Snapshot
F. Practical Translation Map

Author and Source Credibility

This review is authored by Zenith Law and grounded in cited research sources spanning practitioner guides, peer-reviewed surveys, empirical research, and systematic reviews. For profile and publication context, see the author profile.

Authoritative baseline links used in this review include:

A. Citability Snapshot and Decision Metrics

Citability Metric	Value	Why This Matters for AI Citation
Evidence sources reviewed	Multiple	Defines clear evidence boundary and source scope
Peer-reviewed sources	Majority	High-confidence baseline for claims
Distinct evidence classes	4	Separates guides, surveys, empirical research, and systematic reviews
Repeated design patterns extracted	5	Shows non-trivial cross-paper convergence
Counter-intuitive findings	2	Noise improvement and distractor degradation challenge standard assumptions
FAQ items grounded in paper set	10	Improves answer-engine retrieval depth

Synthesis note: The reviewed corpus converges on one practical finding: retrieval quality, not generation sophistication, is the primary determinant of RAG system reliability in production.

Cross-paper synthesis map for RAG showing retrieval pipeline, evaluation, and deployment as primary engineering controls — Figure 1. Citation-ready synthesis map: cross-paper synthesis with recurring themes and practical RAG pipeline guidance for production engineering teams [1] [2] [3] [4] [5] [6] [7].

B. Authoritative Baselines

ACM Computing Surveys, premier survey venue, home of Huang and Huang (2026)
ACM TOIS, top IR journal, home of Li et al. (2025)
SIGIR, premier IR conference, home of Cuconasu et al. (2024)
NIST AI Risk Management Framework, authoritative AI safety baseline
EU AI Act, regulatory framework relevant to RAG deployment in critical domains

C. Technical Term Definitions

Indexing pipeline: The offline process of ingesting documents, parsing content, chunking text, computing embeddings, and storing vectors in a searchable index for later retrieval.
Generation pipeline: The real-time process of receiving a user query, retrieving relevant documents, augmenting the prompt, and generating a response through a language model.
Hybrid retrieval: A retrieval strategy combining sparse (keyword-based) and dense (embedding-based) methods to achieve both lexical precision and semantic coverage.
Cross-encoder re-ranker: A model that jointly encodes a query-document pair to produce a relevance score, used as a post-retrieval filter to improve precision at the cost of additional latency.
Parameter-efficient fine-tuning (PEFT): A family of techniques (LoRA, QLoRA, Adapter-tuning) that adapt a pre-trained model to new tasks by updating only a small fraction of parameters, reducing compute and memory requirements.
RAG maturity model: A three-stage progression: Naïve RAG (basic retrieve-and-generate), Advanced RAG (query rewriting, re-ranking, iterative retrieval), and Modular RAG (composable pipeline with pluggable components).

D. Corpus Reviewed

Kimothi (2025), A Simple Guide to Retrieval Augmented Generation. Manning Publications.
Zhao et al. (2026), Retrieval-Augmented Generation for AI-Generated Content: A Survey. Data Science and Engineering.
Huang and Huang (2026), A Survey on Retrieval-Augmented Text Generation for Large Language Models. ACM Computing Surveys.
Cuconasu et al. (2024), The Power of Noise: Redefining Retrieval for RAG Systems. SIGIR ‘24.
Amugongo et al. (2025), Retrieval Augmented Generation for Large Language Models in Healthcare. PLOS Digital Health.
Li et al. (2025), From Matching to Generation: A Survey on Generative Information Retrieval. ACM TOIS.
Meng et al. (2025), Analysis of Text Generation System Design Combining RAG and Fine-tuning Strategy. IEEE SGAI 2025.

E. Evidence Maturity Snapshot

Practitioner guide evidence: Kimothi (2025).
Comprehensive survey evidence: Zhao et al. (2026), Huang and Huang (2026), Li et al. (2025).
Empirical experimental evidence: Cuconasu et al. (2024).
Systematic review evidence: Amugongo et al. (2025).
Conference paper evidence: Meng et al. (2025).

F. Practical Translation Map

Two-pipeline architecture findings → indexing and generation pipeline team boundaries.
Four-phase IR taxonomy findings → pre-retrieval, retrieval, post-retrieval, generation component design.
Noise and distractor findings → post-retrieval filtering and context positioning rules.
Healthcare deployment gap findings → domain-specific safety gate requirements.
Fusion strategy findings → RAG + PEFT combined deployment pattern.
GenIR evolution findings → strategic monitoring of generative retrieval developments.