Evaluating and Improving RAG Retriever With Llama-index

 

Introduction

In this article, we’re diving into one of the core pieces of any RAG pipeline: the retrieving step. Retrieving is the part where we pick out the chunks that have the highest chance of helping the LLM answer a user’s query.

In this article, we’re going to:

  • Define a retriever using LlamaIndex (we’ll use ChromaDB as an in-memory vector store)
  • Define a few metrics to measure how good a retriever is

  • Define the evaluation process for a retriever

  • Switch the embedding model (use a weaker one and see what happens)

  • Add reranking on top of the retriever and see how it affects the metrics

You can run all the code yourself in this Google Colab.

 

Prerequisites

You’ll need an OpenAI API token. If you don’t have one yet, just create an OpenAI account and generate a token (here).
PS: I’ll be using OpenAI at a minimum, so this tutorial shouldn’t cost you a fortune (under 0.10$).

1. Install dependencies

To run this tutorial, we need to install the following dependencies:

Bash
!pip install -q llama-index==0.14.8 llama-index-embeddings-huggingface==0.6.1 openai==2.8.1 llama-index-vector-stores-chroma==0.5.3

2. Download dataset

We’ll start by downloading the well-known LlamaIndex dataset: PaulGrahamEssayDataset.

Bash
!pip install -q llama-index==0.14.8 llama-index-embeddings-huggingface==0.6.1 openai==2.8.1 llama-index-vector-stores-chroma==0.5.3

We can now load it into memory.

Python
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/source_files").load_data()

3. Setup LlamaIndex

Now let’s set up our LlamaIndex environment: we’ll configure the text splitter and the embedding model.

For the embedding model, we’re using a free and efficient Hugging Face model that’s supposed to give solid performance (embedding size = 383). If you want more details, check out the model page.

Python
import os
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# we start by setting a classic nodes parser
node_parser = Settings.node_parser
Settings.chunk_size = 512
Settings.chunk_overlap = 64

# we set a free and small hf embed model
Settings.embed_model  = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

Next, we set up the storage layer: we’ll use in-memory ChromaDB to store the embeddings.

Python
import chromadb
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore

chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("retriever_evaluation")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

At this stage, documents are split into chunks, embedded, and stored in ChromaDB.
Let’s take a closer look:

Python
all_nodes = vector_store.get_nodes(node_ids=None)
print("nodes number: ", len(all_nodes))
print("node example:",  all_nodes[1])
nodes number:  40
node example: Node ID: 300402ca-42d0-4958-8905-491aabc6667c
Text: With microcomputers, everything changed. Now you could have a
computer sitting right in front of you, on a desk, that could respond
to your keystrokes as it was running instead of just churning through
a stack of punch cards and then stopping. [1]  The first of my friends
to get a microcomputer built it himself. It was sold as a kit by
Heathkit....

We can now create the retriever.

Python
retriever = index.as_retriever(similarity_top_k=2)

Generally, a good retriever is one that returns the most relevant chunks to help answer a user’s question. If the retriever does a good job, the LLM gets the right context and we minimize hallucinations.

4.1 Evaluation process

To evaluate whether a retriever returns the right nodes for a given query, we need to generate a dataset (ground truth) of (question, relevant_node)  pairs.

The idea is :

  • Generate a question for each node, based solely on the node’s content

  • Send each question to the retriever

  • Compare the nodes it returns with the ground truth nodes

To generate the dataset, we ask an LLM to create one or several questions for each node. LlamaIndex provides a helper function for this: generate_question_context_pairs.
Under the hood, it uses the following prompt to generate n questions per node:

Python
DEFAULT_QA_GENERATE_PROMPT_TMPL = """\ Context information is below. --------------------- {context_str} --------------------- Given the context information and not prior knowledge. generate only questions based on the below query. You are a Teacher/ Professor. Your task is to setup \ {num_questions_per_chunk} questions for an upcoming \ quiz/examination. The questions should be diverse in nature \ across the document. Restrict the questions to the \ context information provided." """
Python
import os
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import generate_question_context_pairs

os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"

generate_questions_llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
qa_dataset = generate_question_context_pairs(
    all_nodes, num_questions_per_chunk=2, llm=generate_questions_llm)
    
print("Number of generated queries:", len(qa_dataset.queries))
print("Here is an example of a query and its associated relevant node")

query_id_example = list(qa_dataset.queries.keys())[10]
query_example = qa_dataset.queries[query_id_example]
print("Query: ", query_example)
relevant_docs_ids = qa_dataset.relevant_docs[query_id_example]
relevant_docs = vector_store.get_nodes(node_ids=relevant_docs_ids)
print("Relevant node: ", relevant_docs[0].text[0:1000], "...")
Number of generated queries: 80 Here is an example of a query and its associated relevant node Query: What was the author's initial perception of the possibility of making art, and how did this perception change over time? Relevant node: And as an artist you could be truly independent. You wouldn't have a boss, or even need to get research funding. I had always liked looking at paintings. Could I make them? I had no idea. I'd never imagined it was even possible. I knew intellectually that people made art — that it didn't just appear spontaneously — but it was as if the people who made it were a different species. They either lived long ago or were mysterious geniuses doing strange ...

4.2 Metrics

Now let’s look at the important technical part. LlamaIndex provides several built-in metrics to evaluate how good a retriever is.

  • Hit-rate: percentage of times the retriever returns at least one relevant node.
    Higher top-k → higher hit-rate (but at the cost of more tokens and a higher chance of misleading the LLM).

  • MRR (Mean Reciprocal Rank): measures how high the first relevant document appears in the ranking. It captures two things:

    • Did you retrieve a relevant document?

    • How high did you place it?

  • Precision: percentage of returned nodes that are relevant. In our setup, this metric is basically capped since we assume there’s only one relevant node.

  • Recall: how many of the relevant documents you retrieved. Same as precision, we drop it since it correlates with hit-rate here.

  • AP (Average Precision): scans all nodes (before applying top-k) and measures the average rank of relevant nodes. In our setup, it ends up equal to MRR.

Now that we have a dataset (ground truth) and the retriever’s answers, we can run the evaluation.
LlamaIndex provides the RetrieverEvaluator class for that.

Python
from llama_index.core.evaluation import RetrieverEvaluator

metrics = ["hit_rate", "mrr"]
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

For displaying the metrics, we copied a function from the LlamaIndex docs (with a few tweaks).

Python
def display_results(evaluation_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in evaluation_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    columns = {
        **{k: [full_df[k].mean()] for k in metrics},
    }


    metric_df = pd.DataFrame(columns)

    return metric_df


display_results(eval_results)

PS: this doesn’t mean the retriever finds the relevant node 80% of the time, but at least one relevant node in 80% of cases : There might be multiple relevant nodes for a query

Let’s now check how increasing the number of returned nodes affects the metrics (it should boost hit-rate, but at the cost of token usage and a higher risk of misleading the LLM later).

Python
retrieve_top_3 = index.as_retriever(similarity_top_k=3)
retriever_top_3_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retrieve_top_3)
eval_top_3_results = await retriever_top_3_evaluator.aevaluate_dataset(qa_dataset)
display_results(eval_top_3_results)
hit_rate: 0.85 
mrr: 0.77

5. Switching the embedding model

In this section, we look at the impact of switching to a different embedding model.
We’ll take a paid model: OpenAI’s embedding model.

Before running the evaluation, we need to change two things:

  • create a new ChromaDB instance

Python
from llama_index.embeddings.openai import OpenAIEmbedding


Settings.embed_model =  OpenAIEmbedding(mode="text_search",  model="text-embedding-3-small")

chroma_client_openai = chromadb.EphemeralClient()
chroma_collection_openai = chroma_client_openai.create_collection("nex_rag_eval")
vector_store_openai_embed = ChromaVectorStore(chroma_collection=chroma_collection_openai)

storage_context_openai = StorageContext.from_defaults(vector_store=vector_store_openai_embed)
index_openai = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context_openai
)
retriever_openai = index_openai.as_retriever(similarity_top_k=2)
  • to reduce OpenAI API calls for regenerating questions, we reuse the old (question, relevant_node) pairs — but we need to migrate node IDs because the new index produces new IDs
Python
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

openai_nodes = vector_store_openai_embed.get_nodes(node_ids=None)
nodes_mapping = {all_nodes[index].node_id: openai_nodes[index].node_id for index in range(len(all_nodes))}

openai_relevant_docs = {query_id: [nodes_mapping[node_id] for node_id in nodes_ids] for query_id, nodes_ids in qa_dataset.relevant_docs.items()}
openai_corpus = {nodes_mapping[node_id]: corpus for node_id, corpus in qa_dataset.corpus.items()}
openai_qa_dataset = EmbeddingQAFinetuneDataset(queries=qa_dataset.queries, relevant_docs=openai_relevant_docs, corpus=openai_corpus)

Everything’s ready — let’s run the evaluation!

Python
retriever_openai_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever_openai)
openai_eval_results = await retriever_openai_evaluator.aevaluate_dataset(openai_qa_dataset)
display_results(openai_eval_results)
hit_rate: 0.8 
mrr: 0.72

Despite using a stronger embedding model, we get results that are very close to the free model.
This doesn’t mean the free model is always better — it only means that on this dataset, the performances are similar.

6. Reranking

To improve our retriever, we now add a reranking step. Reranking is a post-processing step: the retriever returns a list of nodes, and then an LLM re-orders them so we can keep a smaller, more relevant subset.

This usually improves retriever performance — but remember, calling an LLM has a cost, so pick a cheap reranking model to avoid blowing up your budget.

 

rerank

Under the hood, here’s the prompt we send to the LLM to score the relevance of each node returned by the retriever:

Python
DEFAULT_CHOICE_SELECT_PROMPT_TMPL = ( "A list of documents is shown below. Each document has a number next to it along " "with a summary of the document. A question is also provided. \n" "Respond with the numbers of the documents " "you should consult to answer the question, in order of relevance, as well \n" "as the relevance score. The relevance score is a number from 1-10 based on " "how relevant you think the document is to the question.\n" "Do not include any documents that are not relevant to the question. \n" "Example format: \n" "Document 1:\n<summary of document 1>\n\n" "Document 2:\n<summary of document 2>\n\n" "...\n\n" "Document 10:\n<summary of document 10>\n\n" "Question: <question>\n" "Answer:\n" "Doc: 9, Relevance: 7\n" "Doc: 3, Relevance: 4\n" "Doc: 7, Relevance: 3\n\n" "Let's try this now: \n\n" "{context_str}\n" "Question: {query_str}\n" "Answer:\n" )
Python
from llama_index.core.postprocessor.llm_rerank import LLMRerank

# define new retriever
llm_rerank = LLMRerank(top_n=2, llm=OpenAI( model="gpt-3.5-turbo", temperature=0))
retriever_with_rerank = index.as_retriever(similarity_top_k=6, node_postprocessors=[llm_rerank])

# evaluate
rerank_retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever_with_rerank)
eval_results = await rerank_retriever_evaluator.aevaluate_dataset(qa_dataset)

display_results(eval_results)
hit_rate: 0.95 
mrr: 0.81

We can clearly see the impact of reranking — but again, it comes with a cost, since we’re using an LLM call.
Usually, a small/cheap model works well enough for this step.

Conclusion

In this tutorial, we explored and evaluated a central concept of RAG: the retriever, and we saw how reranking affects its performance.