如何评估和改进 RAG 应用

在本指南中，您将学习如何使用 Ragas 评估并迭代改进一个 RAG（检索增强生成）应用。

您将完成什么

设置评估数据集
建立衡量 RAG 性能的指标
构建可复用的评估流程
分析错误并系统地改进您的 RAG 应用
学习如何利用 Ragas 进行 RAG 评估

设置并运行 RAG 系统

我们构建了一个简单的 RAG 系统，它从 Hugging Face 文档数据集中检索相关文档，并使用一个 LLM 生成答案。该数据集包含许多 Hugging Face 包的文档页面，以 markdown 格式存储，为测试 RAG 功能提供了丰富的知识库。

完整的实现可在以下地址找到：ragas_examples/improve_rag/

flowchart LR
    A[User Query] --> B[Retrieve Documents<br/>BM25]
    B --> C[Generate Response<br/>OpenAI]
    C --> D[Return Answer]

要运行此程序，请安装依赖项

uv pip install "ragas-examples[improverag]"

然后运行 RAG 应用

import os
import asyncio
from openai import AsyncOpenAI
from ragas_examples.improve_rag.rag import RAG, BM25Retriever

# Set up OpenAI client
os.environ["OPENAI_API_KEY"] = "<your_key>"
openai_client = AsyncOpenAI()

# Create retriever and RAG system
retriever = BM25Retriever()
rag = RAG(openai_client, retriever)

# Query the system
question = "What architecture is the `tokenizers-linux-x64-musl` binary designed for?"
result = await rag.query(question)
print(f"Answer: {result['answer']}")

输出

Answer: It's built for the x86_64 architecture (specifically the x86_64-unknown-linux-musl target — 64-bit Linux with musl libc).

理解 RAG 实现

以上代码使用了一个简单的 RAG 类，演示了核心的 RAG 模式。其工作原理如下

# examples/ragas_examples/improve_rag/rag.py
from typing import Any, Dict, Optional
from openai import AsyncOpenAI

class RAG:
    """Simple RAG system for document retrieval and answer generation."""

    def __init__(self, llm_client: AsyncOpenAI, retriever: BM25Retriever, system_prompt=None, model="gpt-4o-mini", default_k=3):
        self.llm_client = llm_client
        self.retriever = retriever
        self.model = model
        self.default_k = default_k
        self.system_prompt = system_prompt or "Answer only based on documents. Be concise.\n\nQuestion: {query}\nDocuments:\n{context}\nAnswer:"

    async def query(self, question: str, top_k: Optional[int] = None) -> Dict[str, Any]:
        """Query the RAG system."""
        if top_k is None:
            top_k = self.default_k

        return await self._naive_query(question, top_k)

    async def _naive_query(self, question: str, top_k: int) -> Dict[str, Any]:
        """Handle naive RAG: retrieve once, then generate."""
        # 1. Retrieve documents using BM25
        docs = self.retriever.retrieve(question, top_k)

        if not docs:
            return {"answer": "No relevant documents found.", "retrieved_documents": [], "num_retrieved": 0}

        # 2. Build context from retrieved documents
        context = "\n\n".join([f"Document {i}:\n{doc.page_content}" for i, doc in enumerate(docs, 1)])
        prompt = self.system_prompt.format(query=question, context=context)

        # 3. Generate response using OpenAI with retrieved context
        response = await self.llm_client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "answer": response.choices[0].message.content.strip(),
            "retrieved_documents": [{"content": doc.page_content, "metadata": doc.metadata, "document_id": i} for i, doc in enumerate(docs)],
            "num_retrieved": len(docs)
        }

这展示了基本的 RAG 模式：检索相关文档 → 注入到提示中 → 生成答案。

创建评估数据集

我们将使用 huggingface_doc_qa_eval，这是一个关于 Hugging Face 文档的问答数据集。

以下是数据集中的几行示例

问题	预期答案
`tokenizers-linux-x64-musl` 二进制文件是为哪种架构设计的？	x86_64-unknown-linux-musl
BLIP-Diffusion 模型的用途是什么？	BLIP-Diffusion 模型专为可控的文本到图像生成和编辑而设计。
Datasets 服务器 API 中的 /healthcheck 端点的用途是什么？	确保应用正在运行

评估脚本从此处下载数据集，并将其转换为 Ragas 数据集格式

# examples/ragas_examples/improve_rag/evals.py
import urllib.request
from pathlib import Path
from ragas import Dataset
import pandas as pd

def download_and_save_dataset() -> Path:
    dataset_path = Path("datasets/hf_doc_qa_eval.csv")
    dataset_path.parent.mkdir(exist_ok=True)

    if not dataset_path.exists():
        github_url = "https://raw.githubusercontent.com/vibrantlabsai/ragas/main/examples/ragas_examples/improve_rag/datasets/hf_doc_qa_eval.csv"
        urllib.request.urlretrieve(github_url, dataset_path)

    return dataset_path

def create_ragas_dataset(dataset_path: Path) -> Dataset:
    dataset = Dataset(name="hf_doc_qa_eval", backend="local/csv", root_dir=".")
    df = pd.read_csv(dataset_path)

    for _, row in df.iterrows():
        dataset.append({"question": row["question"], "expected_answer": row["expected_answer"]})

    dataset.save()
    return dataset

在核心概念 - 数据集中了解更多关于处理数据集的信息。

设置 RAG 评估指标

现在我们准备好了评估数据集，我们需要指标来衡量 RAG 的性能。从直接衡量核心用例的简单、专注的指标开始。更多关于指标的信息可以在核心概念 - 指标中找到。

这里我们使用一个 correctness（正确性）离散指标，它评估 RAG 的响应是否包含预期答案中的关键信息，并且根据提供的上下文在事实上是准确的。

# examples/ragas_examples/improve_rag/evals.py
from ragas.metrics import DiscreteMetric

# Define correctness metric
correctness_metric = DiscreteMetric(
    name="correctness",
    prompt="""Compare the model response to the expected answer and determine if it's correct.

Consider the response correct if it:
1. Contains the key information from the expected answer
2. Is factually accurate based on the provided context
3. Adequately addresses the question asked

Return 'pass' if the response is correct, 'fail' if it's incorrect.

Question: {question}
Expected Answer: {expected_answer}
Model Response: {response}

Evaluation:""",
    allowed_values=["pass", "fail"],
)

现在我们有了评估指标，我们需要系统地在我们的数据集上运行它。这就是 Ragas 实验的用武之地。

创建评估实验

实验函数在每个数据样本上运行您的 RAG 系统，并使用我们的正确性指标评估响应。更多关于实验的信息可以在核心概念 - 实验中找到。

实验函数接受一个包含问题、预期上下文和预期答案的数据集行，然后

用问题查询 RAG 系统
使用正确性指标评估响应
返回包括分数和原因在内的详细结果

# examples/ragas_examples/improve_rag/evals.py
import asyncio
from typing import Dict, Any
from ragas import experiment

@experiment()
async def evaluate_rag(row: Dict[str, Any], rag: RAG, llm) -> Dict[str, Any]:
    """
    Run RAG evaluation on a single row.

    Args:
        row: Dictionary containing question and expected_answer
        rag: Pre-initialized RAG instance
        llm: Pre-initialized LLM client for evaluation

    Returns:
        Dictionary with evaluation results
    """
    question = row["question"]

    # Query the RAG system
    rag_response = await rag.query(question, top_k=4)
    model_response = rag_response.get("answer", "")

    # Evaluate correctness asynchronously
    score = await correctness_metric.ascore(
        question=question,
        expected_answer=row["expected_answer"],
        response=model_response,
        llm=llm
    )

    # Return evaluation results
    result = {
        **row,
        "model_response": model_response,
        "correctness_score": score.value,
        "correctness_reason": score.reason,
        "mlflow_trace_id": rag_response.get("mlflow_trace_id", "N/A"),  # MLflow trace ID for debugging (explained later)
        "retrieved_documents": [
            doc.get("content", "")[:200] + "..." if len(doc.get("content", "")) > 200 else doc.get("content", "")
            for doc in rag_response.get("retrieved_documents", [])
        ]
    }

    return result

有了数据集、指标和实验函数，我们现在可以评估我们的 RAG 系统的性能了。

运行初始 RAG 实验

启动 MLflow 服务器

在运行评估之前，您必须启动 MLflow 服务器。RAG 系统会自动将追踪信息记录到 MLflow 中，以便进行调试和分析

# Start MLflow server (required - in a separate terminal)
uv run mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000

MLflow UI 将在 http://127.0.0.1:5000 上可用。

运行初始 RAG 实验

现在让我们运行完整的评估流程，以获取我们 RAG 系统的基准性能指标

# Import required components
import asyncio
from datetime import datetime
from ragas_examples.improve_rag.evals import (
    evaluate_rag,
    download_and_save_dataset,
    create_ragas_dataset,
    get_openai_client,
    get_llm_client
)
from ragas_examples.improve_rag.rag import RAG, BM25Retriever

async def run_evaluation():
    # Download and prepare dataset
    dataset_path = download_and_save_dataset()
    dataset = create_ragas_dataset(dataset_path)

    # Initialize RAG components
    openai_client = get_openai_client()
    retriever = BM25Retriever()
    rag = RAG(llm_client=openai_client, retriever=retriever, model="gpt-5-mini", mode="naive")
    llm = get_llm_client()

    # Run evaluation experiment
    exp_name = f"{datetime.now().strftime('%Y%m%d-%H%M%S')}_naiverag"
    results = await evaluate_rag.arun(
        dataset, 
        name=exp_name,
        rag=rag,
        llm=llm
    )

    # Print results
    if results:
        pass_count = sum(1 for result in results if result.get("correctness_score") == "pass")
        total_count = len(results)
        pass_rate = (pass_count / total_count) * 100 if total_count > 0 else 0
        print(f"Results: {pass_count}/{total_count} passed ({pass_rate:.1f}%)")

    return results

# Run the evaluation
results = await run_evaluation()
print(results)

这将下载数据集，初始化 BM25 检索器，对每个样本运行评估实验，并将详细结果保存为 CSV 文件到 experiments/ 目录以供分析。

输出

Results: 43/66 passed (65.2%)
Evaluation completed successfully!

Detailed results:
Experiment(name=20250924-212541_naiverag,  len=66)

通过率为 65.2%，我们现在有了一个基准。experiments/ 目录中的详细结果 CSV 文件现在包含了我们进行错误分析和系统改进所需的所有数据。

在 MLflow 中查看追踪信息

实验结果 CSV 文件包括每次评估的 mlflow_trace_id 和 mlflow_trace_url，让您能够分析详细的执行追踪。这些追踪帮助您准确理解失败发生在何处——无论是在检索、生成还是评估步骤中。

RAG 系统会自动将追踪信息记录到（先前启动的）MLflow 服务器，您可以在 http://127.0.0.1:5000 查看它们。

这让您可以

在 CSV 中分析结果：查看响应、指标分数和原因
通过追踪进行深入探究：点击结果中的 mlflow_trace_url，直接跳转到 MLflow UI 中该次评估的详细执行追踪

专业提示：点击追踪 URL 进行调试

每个评估结果都包含 mlflow_trace_url —— 一个可直接点击的链接，指向 MLflow UI 中的追踪信息。无需手动导航或复制追踪 ID。只需点击即可直接跳转到详细的执行追踪！

分析错误和失败模式

运行评估后，检查 experiments/ 目录中的结果 CSV 文件，以识别失败案例中的模式。每一行都包括 mlflow_trace_id/mlflow_trace_url —— 用于在 MLflow UI 中查看详细的执行追踪。对每个失败案例进行标注，以理解模式，从而改进我们的应用。

分析我们评估中实际的失败模式

在我们的示例中，核心问题是检索失败——BM25 检索器没有找到包含答案的文档。模型正确地遵循了指令，在文档不包含信息时说明情况，但检索到的却是错误的文档。

不良文档检索示例 BM25 检索器未能检索到包含答案的相关文档

问题	预期答案	模型响应	根本原因
“create_repo 的默认仓库类型是什么？”	`model`	“提供的文档没有说明默认的仓库类型……”	BM25 错过了包含 create_repo 详细信息的文档
“BLIP-Diffusion 模型的用途是什么？”	“可控的文本到图像生成和编辑”	“提供的文档没有提及 BLIP‑Diffusion……”	BM25 没有检索到 BLIP-Diffusion 的文档
“用于托管 scikit-learn 模型的新的 Hugging Face 库的名称是什么？”	`Skops`	“提供的文档没有提及或命名任何新的 Hugging Face 库……”	BM25 错过了 Skops 文档

基于此分析，我们可以看到检索是主要的瓶颈。让我们实施有针对性的改进。

改进 RAG 应用

在将检索确定为主要瓶颈后，我们可以通过两种方式改进我们的系统

传统方法侧重于更好的分块、混合搜索或向量嵌入。然而，由于我们的 BM25 检索在使用单一查询时持续错过相关文档，我们将探索一种代理（agentic）方法。

Agentic RAG 让 AI 能够迭代地优化其搜索策略——尝试多个搜索词，并决定何时找到了足够的上下文，而不是依赖于一个静态的查询。

Agentic RAG 实现

flowchart LR
    A[User Query] --> B[AI Agent<br/>OpenAI]
    B --> C[BM25 Tool]
    C --> B
    B --> D[Final Answer]

针对一个示例查询运行 Agentic RAG 应用

# Switch to agentic mode
rag_agentic = RAG(openai_client, retriever, mode="agentic")

question = "What architecture is the `tokenizers-linux-x64-musl` binary designed for?"
result = await rag_agentic.query(question)
print(f"Answer: {result['answer']}")

输出

Answer: It targets x86_64 — i.e. the x86_64-unknown-linux-musl target triple.

理解 Agentic RAG 实现

Agentic RAG 模式使用 OpenAI Agents SDK 创建一个带有 BM25 检索工具的 AI 代理

# Key components from the RAG class when mode="agentic"
from agents import Agent, Runner, function_tool

def _setup_agent(self):
    """Setup agent for agentic mode."""
    @function_tool
    def retrieve(query: str) -> str:
        """Search documents using BM25 retriever for a given query."""
        docs = self.retriever.retrieve(query, self.default_k)
        if not docs:
            return "No documents found."
        return "\n\n".join([f"Doc {i}: {doc.page_content}" for i, doc in enumerate(docs, 1)])

    self._agent = Agent(
        name="RAG Assistant",
        model=self.model,
        instructions="Use short keywords to search. Try 2-3 different searches. Only answer based on documents. Be concise.",
        tools=[retrieve]
    )

async def _agentic_query(self, question: str, top_k: int) -> Dict[str, Any]:
    """Handle agentic mode: agent controls retrieval strategy."""
    result = await Runner.run(self._agent, input=question)
    print(result.answer)

与朴素模式的单次检索调用不同，代理会自主决定何时以及如何搜索——尝试多种关键词组合，直到找到足够的上下文。

再次运行实验并比较结果

现在让我们评估 agentic RAG 方法

# Import required components
import asyncio
from datetime import datetime
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

from ragas_examples.improve_rag.evals import (
    evaluate_rag,
    download_and_save_dataset, 
    create_ragas_dataset,
    get_openai_client,
    get_llm_client
)
from ragas_examples.improve_rag.rag import RAG, BM25Retriever

async def run_agentic_evaluation():
    # Download and prepare dataset
    dataset_path = download_and_save_dataset()
    dataset = create_ragas_dataset(dataset_path)

    # Initialize RAG components with agentic mode
    openai_client = get_openai_client()
    retriever = BM25Retriever()
    rag = RAG(llm_client=openai_client, retriever=retriever, model="gpt-5-mini", mode="agentic")
    llm = get_llm_client()

    # Run evaluation experiment
    exp_name = f"{datetime.now().strftime('%Y%m%d-%H%M%S')}_agenticrag"
    results = await evaluate_rag.arun(
        dataset, 
        name=exp_name,
        rag=rag,
        llm=llm
    )

    # Print results
    if results:
        pass_count = sum(1 for result in results if result.get("correctness_score") == "pass")
        total_count = len(results)
        pass_rate = (pass_count / total_count) * 100 if total_count > 0 else 0
        print(f"Results: {pass_count}/{total_count} passed ({pass_rate:.1f}%)")

    return results

# Run the agentic evaluation
results = await run_agentic_evaluation()
print("\nDetailed results:")
print(results)

Agentic RAG 评估输出

Results: 58/66 passed (87.9%)

太棒了！我们取得了显著的改进，从 65.2%（朴素）提高到 87.9%（代理）——使用 agentic RAG 方法提高了 22.7 个百分点！

性能比较

Agentic RAG 方法相比于朴素 RAG 基线显示出巨大改进

方法	正确性	提升
朴素 RAG	65.2%	-
Agentic RAG	87.9%	+22.7%

将此循环应用于您的 RAG 系统

遵循这种系统化的方法来改进任何 RAG 系统

创建评估数据集：使用您系统中的真实查询或使用 LLM 生成合成数据。
定义指标：选择与您的用例相符的简单指标。保持专注。
运行基准评估：衡量当前性能并分析错误模式，以识别系统性故障。
实施有针对性的改进：基于错误分析，改进检索（分块、混合搜索）、生成（提示、模型），或尝试代理方法。
比较和迭代：将改进与基准进行测试。一次只改变一件事，直到准确性满足业务需求。

Ragas 框架会自动处理编排和结果聚合，让您专注于分析和改进，而不是构建评估基础设施。