Evaluating a RAG system

Evaluating generative AI solutions, particularly those using Retrieval Augmented Generation (RAG), requires a tailored approach. Unlike traditional AI systems that rely on clear statistical metrics for evaluation, generative AI lacks a definitive measure of success. Instead, evaluation relies on scenario-based testing, where testers interact with the system to assess its performance.

Challenges in Evaluating Generative AI

  1. Lack of Statistical Metrics:
    • Traditional AI systems often use metrics such as accuracy, precision, recall, or F1-score to measure performance. These metrics are unsuitable for generative AI because the outputs are open-ended and context-dependent.
    • For example, a response to the query “What are the transparency requirements in the EU AI Act?” could have multiple valid answers depending on the retrieved context.
  2. Subjectivity in Responses:
    • Generative AI outputs are evaluated based on their relevance, accuracy, and completeness, which can vary depending on the user’s perspective or the scenario in question.

Key components of RAG Evaluation

  1. Scenarios
    • Scenarios are specific, predefined tasks or questions that mimic real-world use cases.
  2. Users
    • End-users interact with the system to test its functionality and usability. Users do not necessarily test the content veracity, they are more important to test general functionality like logging in, typing a query, voting on an answers, etc.
    • Role:
      1. Provide feedback on whether systems meets needs/expectations
      2. Test the systems diverse functions
  1. Domain Experts
    • Subject-matter experts who evaluate the accuracy and relevance of the system’s outputs within their field of expertise.
    • Role:
      1. Validate content of answers provided by the system
      2. Identify errors or omissions by the system, which general users might not catch or understand

Testing framework:

The testing framework relies on a tested process from software development known as User Testing. The only addition in a RAG scenario is the presence of domain experts, since they are needed to check the substance of answers, while standard users are still needed to test functionality. This framework can be put in an excel file or something similar, illustrated below:

Scenario NameFunction/s being testedStep number/nameExpected BehaviorReal BehaviorSuccess StatusFeedback/ evidence
User is denied an answer after asking an irrelevant question after a relevant questionRejection of content out of scopeStep 1: Enter a relevant questionGPT provides information on the relevant questionThe GPT can provide information on a relevant questionYes
Step 2: Enter an irrelevant questionGPT can process a second query, the query being irrelevantThe user can enter a second query after the firstYes
Step 3: The GPT rejects the irrelevant questionThe GPT notifies the user that the question is out of scopeThe GPT does not reject the out-of-scope question and provides an answerNoUser feedback and screenshots

Evaluating generative AI systems, particularly RAG setups, requires a user-centered, scenario-based approach. By involving end-users and domain experts in structured testing, organizations can ensure their systems deliver accurate, relevant, and practical outputs. This iterative evaluation process is essential to align the capabilities of RAG systems with real-world needs, fostering trust and usability.

Key Learning Points:

  • Challenges in evaluating generative AI include:
    • Lack of statistical metrics – Open-ended responses make it difficult to measure performance numerically.
    • Subjectivity in responses – Evaluations depend on context, making assessment more qualitative than quantitative.
  • Key components of RAG evaluation:
    • Scenarios – Predefined tasks or queries that mimic real-world use cases.
    • Users – Test general functionality, such as querying, logging in, and providing feedback.
    • Domain Experts – Assess response accuracy and relevance, ensuring the system delivers high-quality answers.
  • Testing Framework:
    • Based on User Testing principles, with the addition of domain expert validation for response quality.
    • Evaluation can be structured in a table, tracking scenarios, expected vs. real behavior, success status, and user feedback.