Evaluation of RAG Pipelines for more reliable LLM applications

Abhinav Kimothi
8 min readJan 3, 2024

Building a PoC RAG pipeline is not overtly complex. LangChain and LlamaIndex have made it quite simple. Developing highly impressive Large Language Model (LLM) applications is achievable through brief training and verification on a limited set of examples. However, to enhance its robustness, thorough testing on a dataset that accurately mirrors the production distribution is imperative.

RAG is a great tool to address hallucinations in LLMs but…even RAGs can suffer from hallucinations

This can be because -

  • The retriever fails to retrieve relevant context or retrieves irrelevant context
  • The LLM, despite being provided the context, does not consider it
  • The LLM instead of answering the query picks irrelevant information from the context

Evaluating Processes

Two processes, therefore, to focus on from an evaluation perspective -

Search & Retrieval

Q. How good is the retrieval of the context from the Vector Database?

Q. Is it relevant to the query?

Q. How much noise (irrelevant information) is present?

Generation

Q. How good is the generated response?

Q. Is the response grounded in the provided context?

Q. Is the response relevant to the query?

Ragas (RAG Assessment)

Jithin James and Shahul ES from Exploding Gradients, in 2023, developed the Ragas framework to address these questions.

Evaluation Data

To evaluate RAG pipelines, the following four data points are recommended

  1. A set of Queries or Prompts for evaluation

2. Retrieved Context for each prompt

3. Corresponding Response or Answer from LLM

4. Ground Truth or known correct response

Datapoints required for evaluating RAG pipelines

Evaluation Metrics

Ragas Metrics

A. Faithfulness

Faithfulness is the measure of the extent to which the response is factually grounded in the retrieved context

Problem addressed : The LLM, despite being provided the context, does not consider it

Evaluated Process : Generation

Score Range : (0,1) Higher score is better

Methodology : Faithfulness identifies the number of “claims” made in the response and calculates the proportion of those “claims” present in the context.

Illustrative Example

Query : Who won the 2023 ODI Cricket World Cup and when?

Context : The 2023 ODI Cricket World Cup concluded on 19 November 2023, with Australia winning the tournament.

Response 1 : High Faithfulness

Australia won on 19 November 2023

Response 2 : Low Faithfulness

Australia won on 15 October 2023

High vs Low Faithfulness

B. Answer Relevance

Answer Relevance is the measure of the extent to which the response is relevant to the query or the prompt

Problem addressed :The LLM instead of answering the query responds with irrelevant information

Evaluated Process : Generation

Score Range : (0,1) Higher score is better

Methodology : For this metric, a response is generated for the initial query or prompt. To compute the score, the LLM is then prompted to generate questions for the generated response several times. The mean cosine similarity between these questions and the original one is then calculated. The concept is that if the answer correctly addresses the initial question, the LLM should generate questions from it that match the original question.

Illustrative Example

Query : Who won the 2023 ODI Cricket World Cup and when?

Response 1 : High Answer Relevance

India won on 19 November 2023

Response 2 : Low Answer Relevance

Cricket world cup is held once every four years

High vs Low Answer Relevance

C. Context Relevance

Context Relevance is the measure of the extent to which the retrieved context is relevant to the query or the prompt

Problem addressed :The retriever fails to retrieve relevant context

Evaluated Process : Retrieval

Score Range : (0,1) Higher score is better

Methodology : The retrieved context should contain information only relevant to the query or the prompt. For context relevance, a metric ‘S’ is estimated. ‘S’ is the number of sentences in the retrieved context that are relevant for responding to the query or the prompt.

Illustrative Example

Query : Who won the 2023 ODI Cricket World Cup and when?

Context 1 : High Context Relevance

The 2023 Cricket World Cup, conclude d on 19 November 2023, with Australia winning the tournament. The tournament took place in ten different stadiums, in ten cities across the country. The final took place between India and Australia at Narendra Modi Stadium

Context 2 : Low Context Relevance

The 2023 Cricket World Cup was the 13th edition of the Cricket World Cup. It was the first Cricket World Cup which India hosted solely. The tournament took place in ten different stadiums.In the first semi-final India beat New Zealand, and in the second semi-final Australia beat South Africa.

High vs Low Context Relevance

D. Context Recall

Context recall measures the extent to which the retrieved context aligns with the “provided” answer or Ground Truth

Problem addressed :The retriever fails to retrieve accurate context

Evaluated Process : Retrieval

Score Range : (0,1) Higher score is better

Methodology : To estimate context recall from the ground truth answer, each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. Ideally, all sentences in the ground truth answer should be attributable to the retrieved context.

Illustrative Example

Query : Who won the 2023 ODI Cricket World Cup and when?

Ground Truth : Australia won the world cup on 19 November, 2023.

Context 1 : High Context Recall

The 2023 Cricket World Cup, concluded on 19 November 2023, with Australia winning the tournament.

Context 2 : Low Context Recall

The 2023 Cricket World Cup was the 13th edition of the Cricket World Cup. It was the first Cricket World Cup which India hosted solely.

E. Context Precision

Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not.

Problem addressed :The retriever fails to rank retrieve context correctly

Evaluated Process : Retrieval

Score Range : (0,1) Higher score is better

Precision@k is a metric used in information retrieval and recommendation systems to evaluate the accuracy of the top k items retrieved or recommended. It measures the proportion of relevant items among the top k items.

F. Answer semantic similarity

Answer semantic similarity evaluates whether the generated response is similar to the “provided” response or Ground Truth.

Problem addressed : The generated response is incorrect

Evaluated Process : Retrieval & Generation

Score Range : (0,1) Higher score is better

Methodology : Answer semantic similarity score is calculated by measuring the semantic similarity between the generated response and the ground truth response.

Answer Semantic Similarity = Similarity (Generated Response, Ground Truth Response)

G. Answer Correctness

Answer correctness evaluates whether the generated response is semantically and factually similar to the “provided” response or Ground Truth.

Problem addressed : The generated response is incorrect

or

Does the pipeline generate the right response?

Evaluated Process : Retrieval & Generation

Score Range : (0,1) Higher score is better

Methodology : Answer correctness score is calculated by measuring the semantic and the factual similarity between the generated response and the ground truth response.

Synthetic Test Data Generation

Generating hundreds of QA (Question-Context-Answer) samples from documents manually can be a time-consuming and labor-intensive task. Moreover, questions created by humans may face challenges in achieving the necessary level of complexity for a comprehensive evaluation, potentially affecting the overall quality of the assessment.

Synthetic Data Generation uses Large Language Models to generate a variety of Questions/Prompts and Responses/Answers from the Documents (Context). It can greatly reduce developer time.

Synthetic Data Generation Pipeline

Ragas Documentation

RAG pipelines stand as a promising solution to address hallucinations and enhance the reliability of Large Language Models (LLMs). Despite their potential, the evaluation of these pipelines becomes imperative to ensure their robustness and accuracy. By employing metrics such as faithfulness, answer relevance, context relevance, context recall, context precision, answer semantic similarity, and answer correctness, developers and researchers can gauge the effectiveness of RAG pipelines. The utilization of synthetic data generation techniques further streamlines the evaluation process, reducing manual effort and enabling the creation of diverse QA samples for a more comprehensive assessment. As the demand for reliable language models continues to grow, the Ragas framework serves as a valuable asset in ensuring the robustness and efficacy of RAG pipelines in handling diverse queries and contexts, contributing to their continued advancement and application in real-world scenarios.

--

--

Abhinav Kimothi

Co-founder and Head of AI @ Yarnit.app || Data Science, Analytics & AIML since 2007 || BITS-Pilani, ISB-Hyderabad || Ex-HSBC, Ex-Genpact, Ex-LTI || Theatre