Skip to main content

Command Palette

Search for a command to run...

Where RAG Fails: Common Failure Cases and Quick Mitigations

Published
7 min read
Where RAG Fails: Common Failure Cases and Quick Mitigations
B

Web Developer, can work with MERN stack including Next.js, and also work with new Generative AI, Agentic AI technology. Available for hiring.

Introduction & Why RAG Fails

Introduction

  • Retrieval-Augmented Generation (RAG) improves AI answers by combining:

    • Retrieval: Searching a knowledge base (like PDF) for relevant information.

    • Generation: Producing a response based on retrieved data.

  • While RAG is powerful, it can fail if the input data, query, or retrieval process is flawed.

Common RAG Failure Cases

  1. Poor Source Files

    • Known as GIGO (Garbage In, Garbage Out).

    • If the knowledge base contains irrelevant, outdated, or poorly structured documents:

      • The retrieved chunks may not answer the query.

      • Final output quality decreases significantly.

  2. Weak User Prompts

    • Users may not know how to phrase questions effectively:

      • Missing technical terms or keywords.

      • Typos, spelling errors, or confusing phrasing.

    • Results:

      • Retrieval returns irrelevant or incomplete chunks.

      • Generated answers may drift off-topic.

  3. Query Drift

    • Occurs when the AI misinterprets the user’s intent.

    • Even small ambiguities in phrasing can lead to retrieval of unrelated chunks.

    • Causes:

      • Vague queries.

      • Missing context.

  4. Outdated Indexes

    • Knowledge base indexes may not include the latest information.

    • Even with a perfect query, the AI may retrieve obsolete or incomplete data.

  5. Hallucinations from Weak Context

    • If retrieved chunks don’t fully cover the query, the LLM may hallucinate:

      • Produces confident but incorrect answers.

      • Often happens when the chunks are too small, fragmented, or lack overlap.

Why Understanding Failures Matters

  • Identifying why RAG fails is the first step toward improving its reliability.

  • Once failure causes are clear, we can design multi-step pipelines to:

    • Enhance queries.

    • Retrieve better chunks.

    • Filter and rank results.

    • Generate more accurate and trustworthy answers.

Mitigation Pipeline

Step 1: Enhancing the User Query

Why Enhance the Query?

  • User queries are often imperfect:

    • Typos, missing context, or vague phrasing.

    • Lack of technical terms or key keywords.

  • Directly using such queries can reduce retrieval accuracy and lead to irrelevant chunks.

  • Enhancing the query ensures the AI retrieves the most relevant information.

Query Enhancement Process

  1. Query Translation / Rewriting

    • Original query (Original_Query) is rewritten by an LLM to:

      • Fix typos and grammatical errors.

      • Add missing context for better retrieval.

      • Clarify ambiguous terms.

    • Example:

      • Original_Query: "cardio heart tips"

      • Rewritten_Query: "Best practices for cardiovascular health, including exercise and diet tips"

  2. Generating Vector Embeddings

    • Both the rewritten query and document chunks are converted into vector embeddings.

    • Vectors allow the system to measure semantic similarity rather than exact word matches.

    • Benefit:

      • Queries like “heart health” can find chunks mentioning “cardiovascular wellness” because their vectors are similar.
  3. Retrieving Relevant Chunks

    • Rewritten_Query is compared against the source vector database.

    • Retrieves the most semantically similar chunks for further processing.

    • Output: Relevant_Chunks

  4. Quality Check with Judge LLMs

    • Different LLMs evaluate the relevance and completeness of retrieved chunks:

      • Do chunks fully answer the query?

      • Are chunks semantically aligned with user intent?

    • If quality is poor:

      • Optionally, external sources (like Google search) can provide additional context.
  5. Redo Query Translation if Needed

    • Based on judge LLM feedback, the query can be enhanced again:

      • Produces Enhanced_Query.

      • Retrieves Enhanced_Relevant_Chunks for higher quality.

Outcome of Step 1

  • The AI now has a well-formed query and a first set of high-quality chunks.

  • This forms a strong foundation for the next step: generating multiple query variants to capture different perspectives.

Step 2: Generating Query Variants

Why Create Query Variants?

  • A single rewritten query may not cover all possible ways information is phrased in the knowledge base.

  • Different users or sources may describe the same concept differently.

  • Generating multiple query variants increases recall and reduces the chance of missing relevant chunks.

Process of Generating Variants

  1. Creating Multiple Variants

    • The Enhanced_Query from Step 1 is rewritten into 3 different variants using LLMs:

      • Example: Enhanced_Query = “Best practices for cardiovascular health”

        • Variant 1: “Tips for maintaining a healthy heart”

        • Variant 2: “How to improve cardiovascular wellness through diet and exercise”

        • Variant 3: “Lifestyle habits for heart health and prevention of heart disease”

    • Purpose: Capture different phrasings and perspectives.

  2. Retrieving Chunks for Each Variant

    • Each of the 3 variants is converted into vector embeddings.

    • Retrieval system finds most semantically relevant chunks for each variant:

      • Output: 3_Variants_Chunks (one set per variant)
    • Benefit: Ensures that no relevant information is overlooked, even if phrasing differs.

  3. Combining Variant Chunks with Original Enhanced Chunks

    • Together with Enhanced_Relevant_Chunks from Step 1, you now have 4 sets of chunks:

      • 1 set from Enhanced_Query

      • 3 sets from query variants

    • These chunks form the candidate pool for ranking and filtering in the next step.

Outcome of Step 2

  • The AI now has a diverse, comprehensive set of chunks that cover multiple ways the information could be described.

  • This improves retrieval coverage and lays the foundation for ranking and filtering to reduce noise and prevent hallucinations.

Step 3: Ranking and Filtering Chunks

Why Ranking and Filtering is Important

  • After generating multiple query variants, you now have 4 sets of chunks (1 from Enhanced_Query + 3 from variants).

  • Not all chunks are equally relevant; some may be repetitive, low-quality, or noisy.

  • Feeding too many chunks to an LLM can:

    • Overwhelm the model

    • Reduce answer accuracy

    • Increase hallucinations

  • Ranking and filtering prioritizes the most relevant, consistent information.

How Ranking and Filtering Works

  1. Ranking by Relevance

    • Each chunk is scored based on semantic similarity to the Enhanced_Query or original question.

    • Methods:

      • Cosine similarity between query embedding and chunk embeddings

      • Re-ranking models (if available) to score chunks for relevance

  2. Identifying Repeated / Consistent Chunks

    • Chunks that appear in multiple variant retrievals are considered more reliable.

    • Repetition acts as a signal of importance.

  3. Filtering Out Noise

    • Remove:

      • Irrelevant chunks (low similarity scores)

      • Duplicate chunks

      • Chunks with conflicting or weak context

    • Result: Filtered_Chunks — a clean, high-quality set of information for the LLM.

  4. Benefits of This Step

    • Reduces context overload for the LLM

    • Minimizes hallucinations from weak or irrelevant sources

    • Ensures the final answer is grounded in the most trustworthy chunks

Outcome of Step 3

  • You now have a refined, ranked set of chunks ready to feed into the LLM.

  • The system is prepared to generate an accurate, contextually-rich response.

Step 4: Final Output with Filtered Chunks

Feeding the LLM

  • The Enhanced_Query along with Filtered_Chunks is passed to the LLM.

  • The LLM now has:

    • A clear, well-written question

    • Highly relevant context

  • This combination allows the model to generate a precise and grounded answer.

Why This Step Matters

  • By now, all irrelevant or weak chunks are removed.

  • The LLM no longer has to guess or hallucinate based on weak context.

  • Ensures output is:

    • Accurate

    • Domain-specific

    • Up-to-date (if source files are recent)

Optional Enhancements

  • Multiple LLM Judges: Run the generated answer through another LLM to check correctness.

  • Fallback Retrieval: If context is insufficient, query external sources (like Google or internal databases) and update chunks.

  • Iterative Refinement: Repeat query translation, retrieval, and filtering for particularly complex questions.

Summary Table of the Pipeline

Step

Action

Purpose

Tools / Techniques

Step 1 – Enhance Query

Rewrite user query, fix errors, add context

Improve retrieval accuracy

LLM-based query rewriting, embeddings, judge LLMs

Step 2 – Generate Variants

Create 3 rewritten versions of query

Capture different phrasings & perspectives

Query rewriting + embeddings

Step 3 – Rank & Filter

Rank all chunks, remove duplicates

Reduce noise & prevent hallucinations

Semantic similarity ranking, rerankers, deduplication

Step 4 – Final Output

Feed best chunks + enhanced query to LLM

Produce accurate, reliable answer

LLM + filtered context

Key Takeaways

  • RAG can fail due to poor sources, weak prompts, outdated indexes, or query drift.

  • Multi-step pipelines improve accuracy, reliability, and trustworthiness.

  • Steps like query enhancement, generating variants, ranking, and filtering help reduce noise and hallucinations.

  • Properly implemented, RAG becomes a powerful, domain-specific, and up-to-date AI assistant.