如何评估和改进您的提示

在本指南中，您将学习如何使用 Ragas 评估并迭代改进提示。

您将完成什么

基于评估的错误分析迭代和改进提示
建立明确的决策标准以在不同提示之间进行选择
为您的数据集构建可重用的评估流程
学习如何利用 Ragas 构建您的评估流程

完整代码

数据集和脚本位于仓库的 examples/iterate_prompt/ 目录下
完整代码可在 GitHub 上获取

任务定义

在这个案例中，我们考虑一个客户支持工单分类任务。

标签 (多标签): Billing (账单), Account (账户), ProductIssue (产品问题), HowTo (如何操作), Feature (功能), RefundCancel (退款取消)
优先级 (仅一个): P0, P1, 或 P2

数据集

我们为我们的用例创建了一个合成数据集。每一行都有 id, text, labels, priority。以下是数据集中的示例行

id	文本 (text)	标签 (labels)	优先级 (priority)
1	已升级到 Plus 版… 银行显示同一天有两笔收费；希望撤销重复的收费。	Billing;RefundCancel	P1
2	通过 Okta 进行的 SSO 成功后又跳回 /login；同事可以登录；状态不匹配；无法访问看板。	Account;ProductIssue	P0
3	需要将一个带有评论和页码的看板导出为 PDF 用于审计；截止日期是下周。	HowTo	P2

要为您的用例自定义数据集，请创建一个 datasets/ 目录并添加您自己的 CSV 文件。您也可以连接到不同的后端。更多信息请参阅核心概念 - 评估数据集。

最好从您的应用程序中抽样真实数据来创建数据集。如果无法获取，您可以使用 LLM 生成合成数据。我们建议使用像 gpt-5 high-reasoning 这样的推理模型，它可以生成更准确和复杂的数据。请务必手动审查和验证您使用的数据。

在数据集上评估您的提示

提示运行器

首先，我们将在一个案例上运行提示，以测试一切是否正常工作。

在此查看完整的提示 v1

You categorize a short customer support ticket into (a) one or more labels and (b) a single priority.

Allowed labels (multi-label):
- Billing: charges, taxes (GST/VAT), invoices, plans, credits.
- Account: login/SSO, password reset, identity/email/account merges.
- ProductIssue: malfunction (crash, error code, won't load, data loss, loops, outages).
- HowTo: usage questions ("where/how do I…", "where to find…").
- Feature: new capability or improvement request.
- RefundCancel: cancel/terminate and/or refund requests.
- AbuseSpam: insults/profanity/spam (not mild frustration).

Priority (exactly one):
- P0 (High): blocked from core action or money/data at risk.
- P1 (Normal): degraded/needs timely help, not fully blocked.
- P2 (Low): minor/info/how-to/feature.

Return exactly in JSON:
{"labels":[<labels>], "priority":"P0"|"P1"|"P2"}

cd examples/iterate_prompt
export OPENAI_API_KEY=your_openai_api_key
uv run run_prompt.py

这将在示例案例上运行提示并打印结果。

示例输出

$ uv run run_prompt.py                      

Test ticket:
"SSO via Okta succeeds then bounces me back to /login with no session. Colleagues can sign in. I tried clearing cookies; same result. Error in devtools: state mismatch. I'm blocked from our boards."

Response:
{"labels":["Account","ProductIssue"], "priority":"P0"}

评分指标

通常情况下，使用更简单的指标比复杂的指标更好。您应该使用与您的用例相关的指标。有关指标的更多信息，请参见核心概念 - 指标。这里我们使用两个离散指标：labels_exact_match (标签完全匹配) 和 priority_accuracy (优先级准确率)。将它们分开有助于分析和修复不同的失败模式。

priority_accuracy: 检查预测的优先级是否与期望的优先级匹配；对于正确的紧急程度分类很重要。
labels_exact_match: 检查预测的标签集是否与期望的标签集完全匹配；这对于避免过多或过少标记很重要，并帮助我们衡量系统在标记案例方面的准确性。

# examples/iterate_prompt/evals.py
import json
from ragas.metrics.discrete import discrete_metric
from ragas.metrics.result import MetricResult

@discrete_metric(name="labels_exact_match", allowed_values=["correct", "incorrect"])
def labels_exact_match(prediction: str, expected_labels: str):
    try:
        predicted = set(json.loads(prediction).get("labels", []))
        expected = set(expected_labels.split(";")) if expected_labels else set()
        return MetricResult(
            value="correct" if predicted == expected else "incorrect",
            reason=f"Expected={sorted(expected)}; Got={sorted(predicted)}",
        )
    except Exception as e:
        return MetricResult(value="incorrect", reason=f"Parse error: {e}")

@discrete_metric(name="priority_accuracy", allowed_values=["correct", "incorrect"])
def priority_accuracy(prediction: str, expected_priority: str):
    try:
        predicted = json.loads(prediction).get("priority")
        return MetricResult(
            value="correct" if predicted == expected_priority else "incorrect",
            reason=f"Expected={expected_priority}; Got={predicted}",
        )
    except Exception as e:
        return MetricResult(value="incorrect", reason=f"Parse error: {e}")

实验函数

实验函数用于在数据集上运行提示。有关实验的更多信息，请参见核心概念 - 实验。

请注意，我们将 prompt_file 作为参数传递，以便我们可以使用不同的提示运行实验。您还可以将其他参数（如模型、温度等）传递给实验函数，并尝试不同的配置。建议在进行实验时一次只更改一个参数。

# examples/iterate_prompt/evals.py
import asyncio, json
from ragas import experiment
from run_prompt import run_prompt

@experiment()
async def support_triage_experiment(row, prompt_file: str, experiment_name: str):
    response = await asyncio.to_thread(run_prompt, row["text"], prompt_file=prompt_file)
    try:
        parsed = json.loads(response)
        predicted_labels = ";".join(parsed.get("labels", [])) or ""
        predicted_priority = parsed.get("priority")
    except Exception:
        predicted_labels, predicted_priority = "", None

    return {
        "id": row["id"],
        "text": row["text"],
        "response": response,
        "experiment_name": experiment_name,
        "expected_labels": row["labels"],
        "predicted_labels": predicted_labels,
        "expected_priority": row["priority"],
        "predicted_priority": predicted_priority,
        "labels_score": labels_exact_match.score(prediction=response, expected_labels=row["labels"]).value,
        "priority_score": priority_accuracy.score(prediction=response, expected_priority=row["priority"]).value,
    }

数据集加载器 (CSV)

数据集加载器用于将数据集加载到 Ragas 数据集对象中。有关数据集的更多信息，请参见核心概念 - 评估数据集。

# examples/iterate_prompt/evals.py
import os, pandas as pd
from ragas import Dataset

def load_dataset():
    current_dir = os.path.dirname(os.path.abspath(__file__))
    df = pd.read_csv(os.path.join(current_dir, "datasets", "support_triage.csv"))
    dataset = Dataset(name="support_triage", backend="local/csv", root_dir=".")
    for _, row in df.iterrows():
        dataset.append({
            "id": str(row["id"]),
            "text": row["text"],
            "labels": row["labels"],
            "priority": row["priority"],
        })
    return dataset

使用当前提示运行实验

uv run evals.py run --prompt_file promptv1.txt

这将对给定的提示在数据集上运行，并将结果保存到 experiments/ 目录中。

示例输出

$ uv run evals.py run --prompt_file promptv1.txt        

Loading dataset...
Dataset loaded with 20 samples
Running evaluation with prompt file: promptv1.txt
Running experiment: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:11<00:00,  1.79it/s]
✅ promptv1: 20 cases evaluated
Results saved to: experiments/20250826-041332-promptv1.csv
promptv1 Labels Accuracy: 80.00%
promptv1 Priority Accuracy: 75.00%

改进提示

分析结果中的错误

在您喜欢的电子表格编辑器中打开 experiments/{timestamp}-promptv1.csv 并分析结果。查找 labels_score 或 priority_score 不正确的案例。

从我们的 promptv1 实验中，我们可以识别出几种错误模式

优先级错误：过度优先 (P1 → P0)

模型持续将应为 P1 的账单相关问题分配为 P0（最高优先级）

案例	问题	预期	实际	模式
ID 19	暂停工作区后自动收费	P1	P0	账单争议被视为紧急
ID 1	同一天重复收费	P1	P0	账单争议被视为紧急
ID 5	取消并请求退款	P1	P0	常规取消被视为紧急
ID 13	关于取消的后续跟进	P1	P0	后续跟进被视为紧急

模式：模型将任何账单/退款/取消视为紧急（P0），而大多数是常规业务操作（P1）。

标签错误：过度标记和混淆

案例	问题	预期	实际	模式
ID 9	美国用户的 GST 税务问题	`Billing;HowTo`	`Billing;Account`	将信息性问题与账户操作混淆
ID 10	账户所有权转移	`Account`	`Account;Billing`	当提到金钱/计划时添加 Billing
ID 20	API 速率限制问题	`ProductIssue;HowTo`	`ProductIssue;Billing;HowTo`	当提到计划时添加 Billing
ID 16	离线模式的功能请求	`Feature`	`Feature;HowTo`	为功能请求添加 HowTo

识别出的模式:

过度标记 Billing：即使主要与账单无关，也添加 "Billing" 标签
HowTo 与 Account 混淆：将信息性问题错误分类为账户管理操作
过度标记 HowTo：当用户询问 "如何" 但实际意思是 "你能否构建这个功能" 时，为功能请求添加 "HowTo" 标签

改进提示

根据我们的错误分析，我们将创建 promptv2_fewshot.txt 并进行有针对性的改进。您可以使用 LLM 生成提示或手动编辑。在本例中，我们将错误模式和原始提示提供给 LLM，以生成一个带有少样本示例的修订版提示。

promptv2_fewshot 中的关键补充

1. 增强的优先级指南，关注业务影响

- P0: Blocked from core functionality OR money/data at risk OR business operations halted
- P1: Degraded experience OR needs timely help BUT has workarounds OR not fully blocked  
- P2: Minor issues OR information requests OR feature requests OR non-urgent how-to

2. 保守的多标签规则，防止过度标记

## Multi-label Guidelines
Use single label for PRIMARY issue unless both aspects are equally important:
- Billing + RefundCancel: Always co-label. Cancellation/refund requests must include Billing.  
- Account + ProductIssue: For auth/login malfunctions (loops, "invalid_token", state mismatch, bounce-backs)
- Avoid adding Billing to account-only administration unless there is an explicit billing operation

Avoid over-tagging: Focus on which department should handle this ticket first.

3. 详细的优先级指南，包含具体场景

## Priority Guidelines  
- Ignore emotional tone - focus on business impact and available workarounds
- Billing disputes/adjustments (refunds, duplicate charges, incorrect taxes/pricing) = P1 unless causing an operational block
- Login workarounds: If Incognito/another account works, prefer P1; if cannot access at all, P0
- Core business functions failing (webhooks, API, sync) = P0

4. 带有推理的全面示例： 添加了 7 个涵盖不同场景的示例，并附有明确的推理，以展示正确的分类方法。

## Examples with Reasoning

Input: "My colleague left and I need to change the team lead role to my email address."
Output: {"labels":["Account"], "priority":"P1"}
Reasoning: Administrative role change; avoid adding Billing unless a concrete billing action is requested.

Input: "Dashboard crashes when I click reports tab, but works fine in mobile app."
Output: {"labels":["ProductIssue"], "priority":"P1"}
Reasoning: Malfunction exists but workaround available (mobile app works); single label since primary issue is product malfunction.

尽量不要直接添加数据集中的示例，因为这可能导致对数据集的过拟合，您的提示在其他情况下可能会失败。

评估新提示

创建带有改进的 promptv2_fewshot.txt 后，使用新提示运行实验

uv run evals.py run --prompt_file promptv2_fewshot.txt

这将在相同的数据集上评估改进后的提示，并将结果保存到一个新的带时间戳的文件中。

示例输出

$ uv run evals.py run --prompt_file promptv2_fewshot.txt

Loading dataset...
Dataset loaded with 20 samples
Running evaluation with prompt file: promptv2_fewshot.txt
Running experiment: 100%|██████████████████████████████████████████████████████████████| 20/20 [00:11<00:00,  1.75it/s]
✅ promptv2_fewshot: 20 cases evaluated
Results saved to: experiments/20250826-231414-promptv2_fewshot.csv
promptv2_fewshot Labels Accuracy: 90.00%
promptv2_fewshot Priority Accuracy: 95.00%

实验将在 experiments/ 目录中创建一个新的 CSV 文件，其结构与第一次运行相同，以便直接比较。

分析和比较结果

我们创建了一个简单的实用函数，可以接收多个 CSV 文件并将其合并，以便我们轻松比较

uv run evals.py compare --inputs experiments/20250826-041332-promptv1.csv experiments/20250826-231414-promptv2_fewshot.csv

这将打印每个实验的准确率，并在 experiments/ 目录中保存一个合并的 CSV 文件。

示例

$ uv run evals.py compare --inputs experiments/20250826-041332-promptv1.csv experiments/20250826-231414-promptv2_fewshot.csv 

promptv1 Labels Accuracy: 80.00%
promptv1 Priority Accuracy: 75.00%
promptv2_fewshot Labels Accuracy: 90.00%
promptv2_fewshot Priority Accuracy: 95.00%
Combined comparison saved to: experiments/20250826-231545-comparison.csv

在这里，我们可以看到 promptv2_fewshot 提高了标签和优先级的准确性。但我们也可以看到一些案例仍然失败。我们可以分析错误并进一步改进提示。

当改进进入平台期或准确性满足业务需求时，停止迭代。

如果仅通过改进提示无法再提高准确性，您可以尝试使用更好的模型进行实验。

将此循环应用于您的用例

为您的用例创建数据集、指标、实验
运行评估并分析错误
根据错误分析改进提示
重新运行评估并比较结果
当改进进入平台期或准确性满足业务需求时停止

一旦您设置好了数据集和评估循环，就可以将其扩展到测试更多参数，如模型等。

Ragas 框架会自动处理编排、并行执行和结果聚合，帮助您评估并专注于您的用例！

高级：对齐 LLM 评判者

如果您使用基于 LLM 的指标进行评估，请考虑首先将您的评判者与人类专家的判断对齐，以确保评估的可靠性。请参阅如何将 LLM 对齐为评判者。