Practical Techniques to Boost Your RAG Performance

 

Introduction

In our previous articles, we covered the theoretical foundations of RAG (Retrieval-Augmented Generation). Understanding the basics is essential, and we’ve already discussed two very popular techniques to boost efficiency: Reranking and Hybrid Retrieval.

However, in the real world, you often need to go a step further. Building a RAG system is a bit like « craftsmanship »: it’s a creative process where you can choose or combine different techniques to reach high performance. In this article, we’ll explore several methods to make your RAG more effective, explaining when to use them and showing real examples.

rag-craftmanship

 

Technique 1: Query Transformation

Even the best search systems can fail if a user’s request is poorly phrased. To fix this, we use Query Transformation, which involves rewriting or enriching the initial request.

The idea is simple: we generate multiple versions of the same question to cover different angles and wording. This helps bridge the « lexical gap » between the words the user chooses and the words actually used in your documents.

For example, let’s ask for alternatives to the following question (where we intentionally included typos): « are there any gd techniques to improve rags »

 

Python
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

query_transformation_prompt = PromptTemplate(
    input_variables=["question"],
    template="""Given the user question: {question}
Generate three alternative versions that express the same information need but with different wording.
1.
"""
)
llm = ChatOpenAI(model="gpt-4.1-2025-04-14", temperature=0.7)
qt_chain = query_transformation_prompt | llm | StrOutputParser()


answer = qt_chain.invoke({"question": "are there any gd techniques to improve rags"})
answer
>> 1. What are some effective ways to enhance RAGs?\n
2. How can RAG models be improved using good techniques?\n
3. What are the best strategies for improving RAG performance?

As you can see, the system now provides several questions that are more complete and relevant

What can we do with these new queries?

Once you have these variations, you can ask your vector store to find the closest documents for each version. Then, you have a few options:

  • Merge the documents: Combine all results to answer the original question.
  • Score filtering: If your vector store provides a similarity score, keep only the top $k$ best documents across all queries.
  • Multiple answers: Answer each version of the question separately and then select the best overall response.

 

Technique 2: Hypothetical Document Embeddings (HyDE)

The concept of HyDE is to use an LLM to generate a « fake » or hypothetical answer based on the user’s query. Instead of searching for documents that look like the question, the system searches for documents that look like this fictional answer.

The workflow: Query → LLM → Fictional Document → Embedding (Vectorization) → Search.

The big advantage: It is much easier to match two statements (a fake answer vs. a real document) than it is to match a question to its answer.

Strong point: This technique is incredibly powerful for complex queries where there is a large semantic gap (differences in vocabulary or structure) between the user’s question and the source material.

Let’s see one example

Python
hyde_prompt = PromptTemplate(
    input_variables=["question"],
    template="""Given the user question: {question}
Write a passage that could contain the answer to this question. Please, write only the answer without any other reactions
"""
)
hyde_chain = hyde_prompt | llm | StrOutputParser()

answer = hyde_chain.invoke({"question": "are there any gd techniques to improve rags"})
answer
Yes, there are several good techniques to improve Retrieval-Augmented Generation (RAG) systems. First, enhancing the retriever component by using dense retrieval models like DPR, ColBERT, or embedding fine-tuned on your domain can boost the relevance of retrieved passages. Second, employing better query reformulation—such as query expansion or iterative retrieval—helps the system find more pertinent documents. Third, filtering or reranking retrieved passages using cross-encoders or re-ranking models

As you can see, there are already many key words like « re-ranking » and « query expansion » which can be used by the retriever to find  good documents

 

Technique 3: Context Comporession

Once documents are retrieved, context compression techniques help distill and organize the information to maximize its value in the generation phase. Context Compression extracts only the most relevant parts of the retrieved documents, removing « noise » that might distract the LLM.

Suppose our retriever returned these two documents for our previous question:

  1. « There are many techniques that can enhance Rag’s performance like Hyde (hypothetical documents embeddings), query transformation, and also some advanced chunking methods… »

  2. « Rags are so important. »

Notice that the first document contains part of the answer, but also extra info we don’t need right now. The second document doesn’t really contain useful information at all.

Python
context_compression_prompt = PromptTemplate(
    input_variables=["question", "context"],
    template="""Given the following question and context, extract any part of the context *AS IS* that is relevant to answer the question. If none of the context is relevant return "context is not good".

Remember, *DO NOT* edit the extracted parts of the context.

> Question: {question}
> Context:
>>>
{context}
>>>
Extracted relevant parts:"""
)

documents_to_compress = [
    """There are many techniques that can enhance Rag’s performance like Hyde (hypothetical documents embeddings) and query transformation.
    Researchers have made a big improvements on LLMs to allow them.
    """,
    "Rags are so important"
]
cc_chain = context_compression_prompt | llm | StrOutputParser()
good_documents = []
for doc in documents_to_compress:
    compressed_doc = cc_chain.invoke({"question": "are there any gd techniques to improve rags", "context": doc})
    if compressed_doc != "context is not good":
        good_documents.append(compressed_doc)

good_documents
['There are many techniques that can enhance Rag’s performance like Hyde (hypothetical documents embeddings) and query transformation.']

=> By using compression, we extract only the specific answer from the first document and ignore the second document entirely => That will help a lot tour RAG to focus on good context.

 

Technique 4: Self-consistency Checking

Self-consistency Checking is a vital validation step. It ensures that the answer generated by the AI is actually supported by the source documents. Think of it as an automatic, internal « fact-checker. »

In a production environment, a smooth-sounding but false answer (a hallucination) is more dangerous than no answer at all. This check creates a feedback loop: if an inconsistency is found, the system can either reject the response or try generating a more accurate one.

Example: We provide our pipeline with the original question, the generated answer, and the source documents. We then ask the system to perform a full analysis and explain its reasoning.

In this test, we will claim in the answer that « context compressing is great for RAG » (even if that specific phrase isn’t in our source documents). The self-consistency check will flag this as unsupported, keeping our system honest!

Python
from typing import List
from pydantic import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate

class Claim(BaseModel):
  claim: str = Field("The factual claim")
  status: str = Field("The status of the claim: partially_supported|contradicted|not_mentioned")
  explanation: str = Field("The explanation of why you choosed the status")

class AnswerAnalysis(BaseModel):
  fully_grounded : bool =  Field("The answer is fully grounded or not")
  not_fully_grounded_claims : List[Claim] = Field("The claims that are not fully supported, this list is empty when all claims are fully grounded")


self_consistency_checking_template = ChatPromptTemplate.from_messages([
    ("system", """You are a fact-checking assistant who verifies whether answers are fully supported by context or not.
    The user will give you a question, an answer and a context and your role is to dentify any statements in the answer
     that are not supported or contradict the context"""),
    ("human", """Please do fact checking for this:
    - question: {question}
    - answer: {answer}
    - context: {context}""")])

scc_chain = self_consistency_checking_template | llm.with_structured_output(AnswerAnalysis)

answer =  scc_chain.invoke({
    "question": "Which techniques are good to improve RAG ?",
    "answer": "Techniques like self consistency checking, Hypothetical Document Embeddings and context compressing are good to improve rag performances",
    "context": """
    - self-consistency checking is proven to be a very good technique to improve rags
    - Hypothetical Document Embeddings was used by some researchers to get better results for their Rags
    """
})
answer
AnswerAnalysis(full_grounded=False, not_fully_grounded_claims=[Claim(claim='context compressing are good to improve rag performances', status='Not Supported', explanation="The context provided does not mention 'context compressing' as a technique to improve RAG performance.")])

🎉 The llm has detected the ungrounded claim !

 

Conclusion

As we have seen, there isn’t just one way to build a RAG system. There is a vast number of techniques available, from transforming queries to checking for consistency.

At the end of the day, building a high-performing RAG is truly artisanal work. It requires patience, testing, and a bit of creativity to find the right combination of tools that fits your specific needs. Don’t be afraid to experiment with these methods to see which ones bring the most value to your project!