从 v0.3 迁移到 v0.4

Ragas v0.4 引入了一个根本性的转变，转向了**基于实验的架构**。这是自 v0.2 以来最重大的变化，从孤立的指标评估转向了一个内聚的实验框架，其中评估、分析和迭代紧密集成。

这种架构变化带来了一些具体的改进

基于集合的指标系统 - 一种标准化的指标方法，可以在实验中无缝工作
统一的 LLM 工厂系统 - 简化的 LLM 初始化，支持通用提供商
现代化的提示词系统 - 基于函数的提示词，更具组合性和可重用性

本指南将引导您了解主要变化，并提供分步迁移说明。

主要变更概述

转向基于实验的架构主要集中在三个核心改进上

以实验为中心的设计 - 从一次性的指标运行转向结构化的实验工作流，并集成分析功能
基于集合的指标 - 专为在实验中工作而设计的指标，返回结构化结果以进行更好的分析和跟踪
增强的 LLM 和提示词系统 - 通用提供商支持和现代提示词模式，支持更好的实验

关键统计数据

已迁移的指标：20+ 核心指标迁移到新的集合系统
破坏性变更：7+ 个主要 API 变更
弃用：旧版包装类和旧的提示词定义
新功能：支持 GPT-5/o 系列，自动约束处理，通用提供商支持

理解基于实验的架构

在迁移之前，了解思维方式的转变会有所帮助

v0.3（以指标为中心）

Data → Individual Metric → Score → Analysis

每个指标运行都相对独立。您会运行一个指标，得到一个浮点数分数，然后在外部处理跟踪/分析。

v0.4（以实验为中心）

Data → Experiment → [Metrics Collection] → Structured Results → Integrated Analysis

现在，指标在一个实验上下文中工作，其中评估、分析和迭代是集成的。这使得

更好地跟踪带有解释的指标结果
更容易比较不同实验运行的结果
内置支持分析指标行为
更清晰的系统迭代工作流

迁移路径

我们建议按以下顺序迁移

更新评估方法（章节：从评估到实验） - 从 evaluate() 切换到 experiment()
更新您的 LLM 设置（章节：LLM 初始化）
迁移指标（章节：指标迁移）
迁移嵌入（章节：嵌入迁移）
更新提示词（章节：提示词系统迁移） - 如果您正在自定义提示词
更新数据模式（章节：数据模式变更）
重构自定义指标（章节：自定义指标）

从评估（Evaluation）到实验（Experiment）

v0.4 用基于 experiment() 的方法取代了 evaluate() 函数，以更好地支持迭代评估工作流和结构化结果跟踪。

变更内容

关键转变：从一个返回分数的**简单评估函数** (evaluate()) 转向一个支持结构化工作流并内置跟踪和版本控制的**实验装饰器** (@experiment())。

之前 (v0.3)

from ragas import evaluate
from ragas.metrics.collections import Faithfulness, AnswerRelevancy

# Setup
dataset = ...  # Your dataset
metrics = [Faithfulness(llm=llm), AnswerRelevancy(llm=llm)]

# Simple evaluation
result = evaluate(
    dataset=dataset,
    metrics=metrics,
    llm=llm,
    embeddings=embeddings
)

print(result)  # Returns EvaluationResult with scores

之后 (v0.4)

from ragas import experiment
from ragas.metrics.collections import Faithfulness, AnswerRelevancy
from pydantic import BaseModel

# Define experiment result structure
class ExperimentResult(BaseModel):
    faithfulness: float
    answer_relevancy: float

# Create experiment function
@experiment(ExperimentResult)
async def run_evaluation(row):
    faithfulness = Faithfulness(llm=llm)
    answer_relevancy = AnswerRelevancy(llm=llm)

    faith_result = await faithfulness.ascore(
        response=row.response,
        retrieved_contexts=row.contexts
    )

    relevancy_result = await answer_relevancy.ascore(
        user_input=row.user_input,
        response=row.response
    )

    return ExperimentResult(
        faithfulness=faith_result.value,
        answer_relevancy=relevancy_result.value
    )

# Run experiment
exp_results = await run_evaluation(dataset)

使用 `experiment()` 的好处

结构化结果 - 精确定义您想要跟踪的内容
逐行控制 - 如果需要，可为每个样本自定义评估
版本跟踪 - 通过 version_experiment() 可选地集成 git
迭代工作流 - 易于修改和重新运行实验
更好的集成 - 与现代指标和数据集无缝协作

LLM 初始化

变更内容

v0.3 系统根据您的用例需要不同的工厂函数

instructor_llm_factory() 用于需要 instructor 的指标
llm_factory() 用于常规 LLM 操作
用于 LangChain 和 LlamaIndex 的各种包装类

v0.4 将所有内容整合到一个**单一的统一工厂**中

from ragas.llms import llm_factory

这个工厂

返回 InstructorBaseRagasLLM，保证结构化输出
自动检测和配置特定于提供商的约束
支持 GPT-5 和 o 系列模型，并自动设置 temperature 和 top_p 约束
适用于所有主要提供商：OpenAI、Anthropic、Cohere、Google、Azure、Bedrock 等。

之前 (v0.3)

from ragas.llms import instructor_llm_factory, llm_factory
from openai import AsyncOpenAI

# For metrics that need instructor
llm = instructor_llm_factory("openai", model="gpt-4o-mini", client=AsyncOpenAI(api_key="..."))

# Or, the old way (not recommended, still supported in 0.3)
client = AsyncOpenAI(api_key="sk-...")
llm = llm_factory("openai", model="gpt-4o-mini", client=client)

之后 (v0.4)

from ragas.llms import llm_factory
from openai import AsyncOpenAI

# Single unified approach - works everywhere
client = AsyncOpenAI(api_key="sk-...")
llm = llm_factory("gpt-4o-mini", client=client)

主要区别

方面	v0.3	v0.4
工厂函数	`instructor_llm_factory()` 或 `llm_factory()`	`llm_factory()`
提供商检测	通过提供商字符串手动指定	从模型名称自动检测
返回类型	`BaseRagasLLM` (各种)	`InstructorBaseRagasLLM`
约束处理	手动配置	对 GPT-5/o 系列自动处理
需要异步客户端	是	是

迁移步骤

更新导入:

# Remove this
from ragas.llms import instructor_llm_factory

# Use this instead
from ragas.llms import llm_factory

替换工厂调用:

# Old - v0.3
llm = instructor_llm_factory("openai", model="gpt-4o", client=client)

# New - v0.4
llm = llm_factory("gpt-4o", client=client)

使用其他提供商更新（模型名称检测自动工作）

# OpenAI
llm = llm_factory("gpt-4o-mini", client=AsyncOpenAI(api_key="..."))

# Anthropic
llm = llm_factory("claude-3-sonnet-20240229", client=AsyncAnthropic(api_key="..."))

# Google
llm = llm_factory("gemini-2.0-flash", client=...)

LLM 包装类（已弃用）

如果您正在使用包装类，它们现在已被弃用，并将在未来被移除

# Deprecated - will be removed
from ragas.llms import LangchainLLMWrapper, LlamaIndexLLMWrapper

# Recommended - use llm_factory directly
from ragas.llms import llm_factory

迁移：用直接的 llm_factory() 调用替换包装类初始化。工厂现在会自动处理提供商检测。

指标迁移

指标变更原因

转向基于实验的架构要求指标能更好地与实验工作流集成

结构化结果：指标现在返回 MetricResult 对象（包含分数+推理过程），而不是原始的浮点数，这使得在实验中可以进行更丰富的分析和跟踪
关键字参数：从样本对象转向直接的关键字参数，使得指标更易于组合和集成到实验管道中
标准化的输入/输出：基于集合的指标遵循一致的模式，使其更容易在其上构建元分析和实验功能

架构变更

指标系统已经完全重新设计，以支持实验工作流。以下是核心区别

基类变更

方面	v0.3	v0.4
导入	`from ragas.metrics import Metric`	`from ragas.metrics.collections import Metric`
基类	`MetricWithLLM`, `SingleTurnMetric`	`BaseMetric` (来自 collections)
评分方法	`async def single_turn_ascore(sample: SingleTurnSample)`	`async def ascore(**kwargs)`
输入类型	`SingleTurnSample` 对象	独立的关键字参数
输出类型	`float` 分数	`MetricResult` (包含 `.value` 和可选的 `.reason`)
LLM 参数	初始化时需要	初始化时需要

评分工作流

v0.3 方法

# 1. Create a sample object containing all data
sample = SingleTurnSample(
    user_input="What is AI?",
    response="AI is artificial intelligence...",
    retrieved_contexts=["Context 1", "Context 2"],
    ground_truths=["AI definition"]
)

# 2. Call metric with the sample
metric = Faithfulness(llm=llm)
score = await metric.single_turn_ascore(sample)  # Returns: 0.85

v0.4 方法

# 1. Call metric with individual arguments
metric = Faithfulness(llm=llm)
result = await metric.ascore(
    user_input="What is AI?",
    response="AI is artificial intelligence...",
    retrieved_contexts=["Context 1", "Context 2"]
)

# 2. Access result properties
print(result.value)      # Score: 0.85 (float)
print(result.reason)     # Optional explanation

v0.4 中可用的指标

以下指标已在 v0.4 中成功迁移到集合系统

RAG 评估指标

忠实度 (Faithfulness) - 回答是否基于检索到的上下文？(v0.3.9+)
答案相关性 (AnswerRelevancy) - 回答是否与用户查询相关？(v0.3.9+)
答案正确性 (AnswerCorrectness) - 回答是否与参考答案匹配？(v0.3.9+)
答案准确性 (AnswerAccuracy) - 答案在事实上是否准确？
上下文精确率 (ContextPrecision) - 检索到的上下文是否按相关性排序？(v0.3.9+)
有参考：ContextPrecisionWithReference
无参考：ContextPrecisionWithoutReference
旧名称：ContextUtilization（现在是 ContextPrecisionWithoutReference 的包装器）
上下文召回率 (ContextRecall) - 是否所有相关上下文都被成功检索？(v0.3.9+)
上下文相关性 (ContextRelevance) - 检索到的上下文中相关内容的百分比是多少？(v0.3.9+)
上下文实体召回率 (ContextEntityRecall) - 参考中的重要实体是否在上下文中？(v0.3.9+)
噪声敏感性 (NoiseSensitivity) - 指标对不相关上下文的鲁棒性如何？(v0.3.9+)
回答依据性 (ResponseGroundedness) - 所有声明是否都基于检索到的上下文？

文本比较指标

语义相似度 (SemanticSimilarity) - 两个文本是否具有相似的语义？(v0.3.9+)
事实正确性 (FactualCorrectness) - 事实性声明是否被正确验证？(v0.3.9+)
BleuScore - 双语评估替补分数 (v0.3.9+)
RougeScore - 面向召回的摘要评估替补 (v0.3.9+)

基于字符串的指标（非 LLM）

完全匹配 (ExactMatch) - 精确字符串匹配
字符串存在性 (StringPresence) - 子字符串存在性检查
莱文斯坦距离 (LevenshteinDistance) - 编辑距离相似度
匹配子字符串 (MatchingSubstrings) - 匹配子字符串的数量
非 LLM 字符串相似度 (NonLLMStringSimilarity) - 各种字符串相似度算法

摘要指标

摘要得分 (SummaryScore) - 整体摘要质量评估 (v0.3.9+)

已移除的指标（不再可用）

方面评价 (AspectCritic) - 请改用 @discrete_metric() 装饰器
简单标准 (SimpleCriteria) - 请改用 @discrete_metric() 装饰器
答案相似度 (AnswerSimilarity) - 请改用 SemanticSimilarity

Agent 和工具指标（尚未迁移）

工具调用准确率 (ToolCallAccuracy) - 仍在使用旧架构（待迁移）
工具调用 F1 值 (ToolCallF1) - 仍在使用旧架构（待迁移）
主题一致性 (TopicAdherence) - 仍在使用旧架构（待迁移）
代理目标准确率 (AgentGoalAccuracy) - 仍在使用旧架构（待迁移）

SQL 指标（尚未迁移）

DataCompy 分数 - 仍在使用旧架构（待迁移）
SQL 查询等效性 (SQL Query Equivalence) - 仍在使用旧架构（待迁移）

通用和评分标准指标（尚未迁移）

领域特定评分标准 (Domain-Specific Rubrics) - 仍在使用旧架构（待迁移）
实例特定评分标准 (Instance-Specific Rubrics) - 仍在使用旧架构（待迁移）

专门化指标（尚未迁移）

多模态忠实度 (Multi-Modal Faithfulness) - 仍在使用旧架构（待迁移）
多模态相关性 (Multi-Modal Relevance) - 仍在使用旧架构（待迁移）
CHRF 分数 - 仍在使用旧架构（待迁移）
引用片段 (Quoted Spans) - 仍在使用旧架构（待迁移）

迁移状态

大约 43% 的核心指标已迁移到集合系统（约 37 个指标中的 16 个）。

这些剩余的指标将在 v0.4.x 版本中迁移。您仍然可以使用旧 API 的旧版指标，但会显示弃用警告。

分步迁移

步骤 1：更新导入

# v0.3
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextPrecision,
    ContextRecall
)

# v0.4
from ragas.metrics.collections import (
    Faithfulness,
    AnswerRelevancy,
    ContextPrecision,
    ContextRecall
)

步骤 2：初始化指标（无需更改）

# v0.3
metric = Faithfulness(llm=llm)

# v0.4 - Same initialization
metric = Faithfulness(llm=llm)

步骤 3：更新指标评分调用

将 single_turn_ascore(sample) 替换为 ascore(**kwargs)

# v0.3
sample = SingleTurnSample(
    user_input="What is AI?",
    response="AI is artificial intelligence.",
    retrieved_contexts=["AI is a technology..."],
    ground_truths=["AI definition"]
)

score = await metric.single_turn_ascore(sample)
print(score)  # Output: 0.85

# v0.4
result = await metric.ascore(
    user_input="What is AI?",
    response="AI is artificial intelligence.",
    retrieved_contexts=["AI is a technology..."]
)

print(result.value)   # Output: 0.85
print(result.reason)  # Optional: "Response is faithful to context"

步骤 4：处理 MetricResult 对象

在 v0.4 中，指标返回 MetricResult 对象而不是原始浮点数

from ragas.metrics.collections.base import MetricResult

result = await metric.ascore(...)

# Access the score
score_value = result.value  # float between 0 and 1

# Access the explanation (if available)
if result.reason:
    print(f"Reason: {result.reason}")

# Convert to float for compatibility
score_float = float(result.value)

特定指标的迁移

忠实度

之前 (v0.3)

sample = SingleTurnSample(
    user_input="What is machine learning?",
    response="ML is a subset of AI.",
    retrieved_contexts=["ML involves algorithms..."]
)
score = await metric.single_turn_ascore(sample)

之后 (v0.4)

result = await metric.ascore(
    user_input="What is machine learning?",
    response="ML is a subset of AI.",
    retrieved_contexts=["ML involves algorithms..."]
)
score = result.value

答案相关性 (AnswerRelevancy)

之前 (v0.3)

sample = SingleTurnSample(
    user_input="What is Python?",
    response="Python is a programming language..."
)
score = await metric.single_turn_ascore(sample)

之后 (v0.4)

result = await metric.ascore(
    user_input="What is Python?",
    response="Python is a programming language..."
)
score = result.value

AnswerCorrectness

注意：此指标现在使用 reference 而不是 ground_truths

之前 (v0.3)

sample = SingleTurnSample(
    user_input="What is AI?",
    response="AI is artificial intelligence.",
    ground_truths=["AI is artificial intelligence and machine learning."]
)
score = await metric.single_turn_ascore(sample)

之后 (v0.4)

result = await metric.ascore(
    user_input="What is AI?",
    response="AI is artificial intelligence.",
    reference="AI is artificial intelligence and machine learning."
)
score = result.value

上下文精确率 (ContextPrecision)

之前 (v0.3)

sample = SingleTurnSample(
    user_input="What is RAG?",
    response="RAG improves LLM accuracy.",
    retrieved_contexts=["RAG = Retrieval Augmented Generation...", "..."],
    ground_truths=["RAG definition"]
)
score = await metric.single_turn_ascore(sample)

之后 (v0.4)

result = await metric.ascore(
    user_input="What is RAG?",
    response="RAG improves LLM accuracy.",
    retrieved_contexts=["RAG = Retrieval Augmented Generation...", "..."],
    reference="RAG definition"
)
score = result.value

提示词系统迁移

提示词变更原因

向模块化架构的转变意味着提示词现在是**一等公民组件**，可以被

为每个指标定制 - 每个指标都有一个定义良好的提示词接口
类型安全 - 输入/输出模型定义了期望的精确结构
可重用 - 提示词类在不同指标中遵循一致的模式
可测试 - 提示词可以独立生成和检查

v0.3 使用简单的基于字符串或数据类的提示词，散布在各个指标中。v0.4 将它们整合到一个统一的 BasePrompt 架构中，并配有专用的输入/输出模型。

架构变更

基础提示词系统

方面	v0.3	v0.4
提示词定义	`PydanticPrompt` 数据类或字符串	带有 `to_string()` 方法的 `BasePrompt` 类
输入/输出类型	通用的 Pydantic 模型	特定于指标的输入/输出模型
访问方法	散布在指标代码中	集中在指标的 `util.py` 模块中
自定义	困难，需要深入更改	通过 `instruction` 和 `examples` 属性进行简单的子类化
组织	混合在指标文件中	组织在单独的 `util.py` 文件中

v0.4 中可用的指标提示词

以下指标现在具有定义良好、可自定义的提示词

忠实度 (Faithfulness) - FaithfulnessPrompt, FaithfulnessInput, FaithfulnessOutput
上下文召回率 (Context Recall) - ContextRecallPrompt, ContextRecallInput, ContextRecallOutput
上下文精确率 (Context Precision) - ContextPrecisionPrompt, ContextPrecisionInput, ContextPrecisionOutput
答案相关性 (Answer Relevancy) - AnswerRelevancyPrompt, AnswerRelevancyInput, AnswerRelevancyOutput
答案正确性 (Answer Correctness) - AnswerCorrectnessPrompt, AnswerCorrectnessInput, AnswerCorrectnessOutput
回答依据性 (Response Groundedness) - ResponseGroundednessPrompt, ResponseGroundednessInput, ResponseGroundednessOutput
答案准确性 (Answer Accuracy) - AnswerAccuracyPrompt, AnswerAccuracyInput, AnswerAccuracyOutput
上下文相关性 (Context Relevance) - ContextRelevancePrompt, ContextRelevanceInput, ContextRelevanceOutput
上下文实体召回率 (Context Entity Recall) - ContextEntityRecallPrompt, ContextEntityRecallInput, ContextEntityRecallOutput
事实正确性 (Factual Correctness) - ClaimDecompositionPrompt, VerificationPrompt, 以及相关的输入/输出模型
噪声敏感性 (Noise Sensitivity) - NoiseAugmentationPrompt 及相关模型
摘要得分 (Summary Score) - SummaryScorePrompt, SummaryScoreInput, SummaryScoreOutput

分步迁移

步骤 1：访问指标中的提示词

from ragas.metrics.collections import Faithfulness
from ragas.llms import llm_factory

# Create metric instance
metric = Faithfulness(llm=llm)

# Access the prompt object
print(metric.prompt)  # <ragas.metrics.collections.faithfulness.util.FaithfulnessPrompt>

步骤 2：查看提示词字符串

from ragas.metrics.collections.faithfulness.util import FaithfulnessInput

# Create sample input
sample_input = FaithfulnessInput(
    response="The Eiffel Tower is in Paris.",
    context="The Eiffel Tower is located in Paris, France."
)

# Generate prompt string
prompt_string = metric.prompt.to_string(sample_input)
print(prompt_string)

步骤 3：自定义提示词（如果需要）

选项 A：子类化默认提示词

from ragas.metrics.collections import Faithfulness
from ragas.metrics.collections.faithfulness.util import FaithfulnessPrompt

# Create custom prompt by subclassing
class CustomFaithfulnessPrompt(FaithfulnessPrompt):
    @property
    def instruction(self):
        return """Your custom instruction here."""

# Apply to metric
metric = Faithfulness(llm=llm)
metric.prompt = CustomFaithfulnessPrompt()

选项 B：为领域特定评估自定义示例

from ragas.metrics.collections.faithfulness.util import (
    FaithfulnessInput,
    FaithfulnessOutput,
    FaithfulnessPrompt,
    StatementFaithfulnessAnswer,
)

class DomainSpecificPrompt(FaithfulnessPrompt):
    examples = [
        (
            FaithfulnessInput(
                response="ML uses statistical techniques.",
                context="Machine learning is a field that uses algorithms to learn from data.",
            ),
            FaithfulnessOutput(
                statements=[
                    StatementFaithfulnessAnswer(
                        statement="ML uses statistical techniques.",
                        reason="Related to learning from data, but context doesn't explicitly mention statistical techniques.",
                        verdict=0
                    ),
                ]
            ),
        ),
    ]

# Apply custom prompt
metric = Faithfulness(llm=llm)
metric.prompt = DomainSpecificPrompt()

常见的提示词自定义

更改指令

大多数指标允许覆盖 instruction 属性

class StrictFaithfulnessPrompt(FaithfulnessPrompt):
    @property
    def instruction(self):
        return """Be very strict when judging faithfulness.
Only mark statements as faithful (verdict=1) if they are directly stated or strongly implied."""

添加领域示例

领域特定的示例能显著提高指标准确性（提升 10-20%）

class MedicalFaithfulnessPrompt(FaithfulnessPrompt):
    examples = [
        # Medical domain examples here
    ]

更改输出格式

对于高级定制，可以子类化提示词并覆盖 to_string() 方法

class CustomPrompt(FaithfulnessPrompt):
    def to_string(self, input: FaithfulnessInput) -> str:
        # Custom prompt generation logic
        return "..."

验证自定义提示词

在使用自定义提示词之前，请务必进行验证

# Test prompt generation
sample_input = FaithfulnessInput(
    response="Test response.",
    context="Test context."
)

custom_metric = Faithfulness(llm=llm)
custom_metric.prompt = MyCustomPrompt()

# View the generated prompt
prompt_string = custom_metric.prompt.to_string(sample_input)
print(prompt_string)

# Then use it for evaluation
result = await custom_metric.ascore(
    response="Test response.",
    context="Test context."
)

从 v0.3 自定义提示词迁移

如果您在 v0.3 中使用 PydanticPrompt 有自定义提示词

之前 (v0.3) - 数据类方法

from ragas.prompt.pydantic_prompt import PydanticPrompt
from pydantic import BaseModel

class MyInput(BaseModel):
    response: str
    context: str

class MyOutput(BaseModel):
    is_faithful: bool

class MyPrompt(PydanticPrompt[MyInput, MyOutput]):
    instruction = "Check if response is faithful to context"
    input_model = MyInput
    output_model = MyOutput
    examples = [...]

之后 (v0.4) - BasePrompt 方法

from ragas.metrics.collections.base import BasePrompt
from pydantic import BaseModel

class MyInput(BaseModel):
    response: str
    context: str

class MyOutput(BaseModel):
    is_faithful: bool

class MyPrompt(BasePrompt):
    @property
    def instruction(self):
        return "Check if response is faithful to context"

    @property
    def input_model(self):
        return MyInput

    @property
    def output_model(self):
        return MyOutput

    @property
    def examples(self):
        return [...]

    def to_string(self, input: MyInput) -> str:
        # Generate prompt string from input
        return f"Check if this is faithful: {input.response}"

使用 BasePrompt.adapt() 进行语言适配

v0.4 在 BasePrompt 实例上引入了 adapt() 方法用于语言翻译，取代了已弃用的 PromptMixin.adapt_prompts() 方法。

之前 (v0.3) - PromptMixin 方法

from ragas.prompt.mixin import PromptMixin
from ragas.metrics import Faithfulness

# Metrics inherited from PromptMixin to use adapt_prompts
class MyFaithfulness(Faithfulness, PromptMixin):
    pass

metric = MyFaithfulness(llm=llm)

# Adapt ALL prompts to another language
adapted_prompts = await metric.adapt_prompts(
    language="spanish",
    llm=llm,
    adapt_instruction=True
)

# Apply all adapted prompts
metric.set_prompts(**adapted_prompts)

v0.3 方法的问题： - 需要混入继承（紧密耦合） - 所有提示词一起适配（不灵活） - 混入方法散布在代码库中

之后 (v0.4) - BasePrompt.adapt() 方法

from ragas.metrics.collections import Faithfulness

# Create metric with default prompt
metric = Faithfulness(llm=llm)

# Adapt individual prompt to another language
adapted_prompt = await metric.prompt.adapt(
    target_language="spanish",
    llm=llm,
    adapt_instruction=True
)

# Apply adapted prompt
metric.prompt = adapted_prompt

# Use metric with adapted language
result = await metric.ascore(
    response="...",
    retrieved_contexts=[...]
)

保存和加载提示词将在 v0.4.x 的未来版本中通过 BasePrompt 提供。目前，只有 PromptMixin 支持此功能。

语言适配示例

不带指令文本的适配（轻量级）

from ragas.metrics.collections import AnswerRelevancy

metric = AnswerRelevancy(llm=llm)

# Only update language field, keep instruction in English
adapted_prompt = await metric.prompt.adapt(
    target_language="french",
    llm=llm,
    adapt_instruction=False  # Default - just updates language
)

metric.prompt = adapted_prompt
print(metric.prompt.language)  # "french"

带指令翻译的适配（完整翻译）

# Translate both instruction and examples
adapted_prompt = await metric.prompt.adapt(
    target_language="german",
    llm=llm,
    adapt_instruction=True  # Translate instruction text too
)

metric.prompt = adapted_prompt

# Examples are also automatically translated
# Both instruction and examples in German now

适配自定义提示词

from ragas.metrics.collections.faithfulness.util import FaithfulnessPrompt

class CustomFaithfulnessPrompt(FaithfulnessPrompt):
    @property
    def instruction(self):
        return "Custom instruction in English"

prompt = CustomFaithfulnessPrompt(language="english")

# Adapt to Italian
adapted = await prompt.adapt(
    target_language="italian",
    llm=llm,
    adapt_instruction=True
)

# Check language was updated
assert adapted.language == "italian"

从 v0.3 迁移到 v0.4

步骤 1：移除 PromptMixin 继承

# v0.3
from ragas.prompt.mixin import PromptMixin
from ragas.metrics import Faithfulness

class MyMetric(Faithfulness, PromptMixin):  # ← Remove PromptMixin
    pass

# v0.4
from ragas.metrics.collections import Faithfulness

# No mixin needed - just use the metric directly
metric = Faithfulness(llm=llm)

步骤 2：用 adapt() 替换 adapt_prompts()

# v0.3
adapted_prompts = await metric.adapt_prompts(
    language="spanish",
    llm=llm,
    adapt_instruction=True
)
metric.set_prompts(**adapted_prompts)

# v0.4
adapted_prompt = await metric.prompt.adapt(
    target_language="spanish",
    llm=llm,
    adapt_instruction=True
)
metric.prompt = adapted_prompt

完整的迁移示例

之前 (v0.3)

from ragas.prompt.mixin import PromptMixin
from ragas.metrics import Faithfulness, AnswerRelevancy

class MyMetrics(Faithfulness, AnswerRelevancy, PromptMixin):
    pass

# Setup
metrics = MyMetrics(llm=llm)

# Adapt multiple metrics to Spanish
adapted = await metrics.adapt_prompts(
    language="spanish",
    llm=best_llm,
    adapt_instruction=True
)

metrics.set_prompts(**adapted)
metrics.save_prompts("./spanish_prompts")

之后 (v0.4)

from ragas.metrics.collections import Faithfulness, AnswerRelevancy

# Setup individual metrics
faith_metric = Faithfulness(llm=llm)
answer_metric = AnswerRelevancy(llm=llm)

# Adapt each metric's prompt independently
faith_adapted = await faith_metric.prompt.adapt(
    target_language="spanish",
    llm=best_llm,
    adapt_instruction=True
)
faith_metric.prompt = faith_adapted

answer_adapted = await answer_metric.prompt.adapt(
    target_language="spanish",
    llm=best_llm,
    adapt_instruction=True
)
answer_metric.prompt = answer_adapted

# Use metrics with adapted prompts
faith_result = await faith_metric.ascore(...)
answer_result = await answer_metric.ascore(...)

数据模式变更

SingleTurnSample 更新

SingleTurnSample 模式已更新，包含破坏性变更

`ground_truths` → `reference`

ground_truths 参数已在所有地方重命名为 reference

之前 (v0.3)

sample = SingleTurnSample(
    user_input="...",
    response="...",
    ground_truths=["correct answer"]  # List of strings
)

之后 (v0.4)

sample = SingleTurnSample(
    user_input="...",
    response="...",
    reference="correct answer"  # Single string
)

v0.3 使用 ground_truths 作为一个**列表**
v0.4 使用 reference 作为一个**单一字符串**
对于多个参考，请使用单独的评估运行

更新后的模式

from ragas import SingleTurnSample

# v0.4 complete sample
sample = SingleTurnSample(
    user_input="What is AI?",                      # Required
    response="AI is artificial intelligence.",     # Required
    retrieved_contexts=["Context 1", "Context 2"], # Optional
    reference="Correct definition of AI"           # Optional (was ground_truths)
)

EvaluationDataset 更新

如果您正在使用 EvaluationDataset，请更新您的数据加载方式

之前 (v0.3)

dataset = EvaluationDataset(
    samples=[
        SingleTurnSample(
            user_input="Q1",
            response="A1",
            ground_truths=["correct"]
        )
    ]
)

之后 (v0.4)

dataset = EvaluationDataset(
    samples=[
        SingleTurnSample(
            user_input="Q1",
            response="A1",
            reference="correct"
        )
    ]
)

如果从 CSV/JSON 加载，请更新您的数据文件

之前 (v0.3) CSV 格式

user_input,response,retrieved_contexts,ground_truths
"Q1","A1","[""ctx1""]","[""correct""]"

之后 (v0.4) CSV 格式

user_input,response,retrieved_contexts,reference
"Q1","A1","[""ctx1""]","correct"

自定义指标

对于使用基于集合架构的指标

如果您已经编写了扩展自 collections 中 BaseMetric 的自定义指标，则只需进行最少的更改

from ragas.metrics.collections.base import BaseMetric, MetricResult
from pydantic import BaseModel

class MyCustomMetric(BaseMetric):
    name: str = "my_metric"
    dimensions: list[str] = ["my_dimension"]

    async def ascore(self, **kwargs) -> MetricResult:
        # Your metric logic
        score = 0.85
        reason = "Explanation of the score"
        return MetricResult(value=score, reason=reason)

关键考虑因素

扩展 BaseMetric，而不是旧的 MetricWithLLM
实现 async def ascore(**kwargs) 而不是 single_turn_ascore(sample)
返回 MetricResult 对象，而不是原始浮点数
使用关键字参数而不是 SingleTurnSample

对于使用旧版架构的指标

如果您有扩展 SingleTurnMetric 或 MetricWithLLM 的自定义指标

# v0.3 - Legacy approach
from ragas.metrics.base import MetricWithLLM

class MyMetric(MetricWithLLM):
    async def single_turn_ascore(self, sample: SingleTurnSample) -> float:
        # Extract values from sample
        user_input = sample.user_input
        response = sample.response
        contexts = sample.retrieved_contexts or []

        # Your logic
        return 0.85

迁移路径

改为扩展来自 collections 的 BaseMetric
更改方法签名以使用关键字参数
返回 MetricResult 而不是浮点数
如果不存在，则添加 dimensions 属性

# v0.4 - Collections approach
from ragas.metrics.collections.base import BaseMetric, MetricResult

class MyMetric(BaseMetric):
    name: str = "my_metric"
    dimensions: list[str] = ["quality"]

    async def ascore(self,
                    user_input: str,
                    response: str,
                    retrieved_contexts: list[str] | None = None,
                    **kwargs) -> MetricResult:
        # Use keyword arguments directly
        contexts = retrieved_contexts or []

        # Your logic
        score = 0.85
        return MetricResult(value=score, reason="Optional explanation")

提示词系统更新

v0.3 - 基于数据类（Dataclass）的提示词

from ragas.prompt.pydantic_prompt import PydanticPrompt
from pydantic import BaseModel

class Input(BaseModel):
    query: str
    document: str

class Output(BaseModel):
    is_relevant: bool

class RelevancePrompt(PydanticPrompt[Input, Output]):
    instruction = "Is the document relevant to the query?"
    input_model = Input
    output_model = Output
    examples = [...]

v0.4 - 基于函数的提示词

新方法使用简单的函数

def relevance_prompt(query: str, document: str) -> str:
    return f"""Determine if the document is relevant to the query.

Query: {query}
Document: {document}

Respond with YES or NO."""

优点

更简单，更易于组合
没有样板类定义
更容易测试和修改
原生的 Python 类型提示

迁移

确定您在自定义指标中定义提示词的位置
将数据类定义转换为函数
更新指标以直接使用该函数

已移除的功能

以下功能已从 v0.4 中完全移除，如果使用将导致错误

函数

instructor_llm_factory() - 完全移除

合并到：llm_factory() 函数
迁移：将所有对 instructor_llm_factory() 的调用替换为 llm_factory()
影响：直接的破坏性变更，无回退方案

之前 (v0.3) - 不再有效

llm = instructor_llm_factory("openai", model="gpt-4o", client=client)

之后 (v0.4) - 请改用此方法

llm = llm_factory("gpt-4o", client=client)

评估指标

三个指标已从集合 API 中完全移除。它们不再可用，也没有直接的替代品

1. AspectCritic - 已移除

原因：被更灵活的离散指标模式所取代
替代方案：使用 @discrete_metric() 装饰器进行自定义方面评估

用法:

# Instead of AspectCritic, use:
from ragas.metrics import discrete_metric

@discrete_metric(name="aspect_critic", allowed_values=["positive", "negative", "neutral"])
def evaluate_aspect(response: str, aspect: str) -> str:
    # Your evaluation logic
    return "positive"

2. SimpleCriteria - 已移除

原因：被更灵活的离散指标模式所取代
替代方案：使用 @discrete_metric() 装饰器进行自定义标准评估

用法:

from ragas.metrics import discrete_metric

@discrete_metric(name="custom_criteria", allowed_values=["pass", "fail"])
def evaluate_criteria(response: str, criteria: str) -> str:
    return "pass" if criteria in response else "fail"

3. AnswerSimilarity - 已移除（冗余）

原因：功能完全被 SemanticSimilarity 覆盖
直接替代：SemanticSimilarity

用法:

# v0.3 - No longer available
from ragas.metrics import AnswerSimilarity  # ERROR

# v0.4 - Use this instead
from ragas.metrics.collections import SemanticSimilarity
metric = SemanticSimilarity(llm=llm)
result = await metric.ascore(
    reference="Expected answer",
    response="Actual answer"
)

已弃用的方法（在 v0.4 中移除）

Metric.ascore() 和 Metric.score() - 已移除

移除时间：在 v0.3 中标记为移除，在 v0.4 中移除
原因：被基于集合的 ascore(**kwargs) 模式取代
迁移：改用集合指标

基于样本的旧方法 - 已移除

single_turn_ascore(sample: SingleTurnSample) - 仅在旧版指标上
替换为：使用 ascore(**kwargs) 的集合指标

已弃用的功能

这些功能仍然有效，但会显示弃用警告。它们将在**未来版本**中被移除。

evaluate() 函数 - 已弃用

状态：仍然有效，但不鼓励使用
原因：被 @experiment() 装饰器取代，以实现更好的结构化工作流
迁移：请参阅从评估到实验部分

之前 (v0.3) - 已弃用

from ragas import evaluate

result = evaluate(dataset=dataset, metrics=metrics, llm=llm, embeddings=embeddings)

之后 (v0.4) - 推荐

from ragas import experiment
from pydantic import BaseModel

class Results(BaseModel):
    score: float

@experiment(Results)
async def run(row):
    result = await metric.ascore(**row.dict())
    return Results(score=result.value)

result = await run(dataset)

LLM 包装类

LangchainLLMWrapper - 已弃用

状态：仍然有效，但不鼓励使用

弃用警告:

Direct usage of LangChain LLMs with Ragas prompts is deprecated and will be
removed in a future version. Use Ragas LLM interfaces instead

迁移：改用 llm_factory() 和原生客户端

之前 (v0.3) - 已弃用

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

langchain_llm = ChatOpenAI(model="gpt-4o")
ragas_llm = LangchainLLMWrapper(langchain_llm)

之后 (v0.4) - 推荐

from ragas.llms import llm_factory
from openai import AsyncOpenAI

client = AsyncOpenAI(api_key="...")
ragas_llm = llm_factory("gpt-4o", client=client)

LlamaIndexLLMWrapper - 已弃用

状态：仍然有效，但不鼓励使用
与 LangchainLLMWrapper 类似的警告
迁移：使用 llm_factory() 和原生客户端

之前 (v0.3) - 已弃用

from ragas.llms import LlamaIndexLLMWrapper
from llama_index.llms.openai import OpenAI

llamaindex_llm = OpenAI(model="gpt-4o")
ragas_llm = LlamaIndexLLMWrapper(llamaindex_llm)

之后 (v0.4) - 推荐

from ragas.llms import llm_factory
from openai import AsyncOpenAI

client = AsyncOpenAI(api_key="...")
ragas_llm = llm_factory("gpt-4o", client=client)

嵌入（Embeddings）迁移

LangchainEmbeddingsWrapper 和 LlamaIndexEmbeddingsWrapper - 已弃用

状态：仍然有效，但会显示弃用警告
原因：被直接与客户端集成的原生嵌入提供商所取代
迁移：请参阅嵌入迁移部分

v0.4 用**原生嵌入提供商**取代了包装类，这些提供商直接与客户端库集成，而不是使用 LangChain 包装器。

变更内容

方面	v0.3	v0.4
类	`LangchainEmbeddingsWrapper`, `LlamaIndexEmbeddingsWrapper`	`OpenAIEmbeddings`, `GoogleEmbeddings`, `HuggingFaceEmbeddings`
客户端	LangChain/LlamaIndex 包装器	原生客户端 (OpenAI, Google 等)
方法	`embed_query()`, `embed_documents()`	`embed_text()`, `embed_texts()`
设置	包装现有的 LangChain 对象	直接传递原生客户端

OpenAI 迁移

之前 (v0.3)

from langchain_openai import OpenAIEmbeddings as LangChainEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper

embeddings = LangchainEmbeddingsWrapper(
    LangChainEmbeddings(api_key="sk-...")
)
embedding = embeddings.embed_query("text")

之后 (v0.4)

from openai import AsyncOpenAI
from ragas.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    client=AsyncOpenAI(api_key="sk-..."),
    model="text-embedding-3-small"
)
embedding = embeddings.embed_text("text")  # Different method name

Google Embeddings 迁移

之前 (v0.3)

from langchain_community.embeddings import VertexAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper

embeddings = LangchainEmbeddingsWrapper(
    VertexAIEmbeddings(model_name="textembedding-gecko@001", project="my-project")
)

之后 (v0.4)

from ragas.embeddings import GoogleEmbeddings

embeddings = GoogleEmbeddings(
    model="text-embedding-004",
    use_vertex=True,
    project_id="my-project"
)

HuggingFace 迁移

之前 (v0.3)

from ragas.embeddings import HuggingfaceEmbeddings

embeddings = HuggingfaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

之后 (v0.4)

from ragas.embeddings import HuggingFaceEmbeddings  # Capitalization changed

embeddings = HuggingFaceEmbeddings(
    model="sentence-transformers/all-MiniLM-L6-v2",
    device="cuda"  # Optional GPU acceleration
)

使用 embedding_factory()

之前 (v0.3)

from ragas.embeddings import embedding_factory

embeddings = embedding_factory()  # Defaults to OpenAI

之后 (v0.4)

from ragas.embeddings import embedding_factory
from openai import AsyncOpenAI

embeddings = embedding_factory(
    provider="openai",
    model="text-embedding-3-small",
    client=AsyncOpenAI(api_key="sk-...")
)

提示词系统

基于数据类（Dataclass）的提示词 (PydanticPrompt) - 已弃用

状态：旧版提示词仍然有效，但不鼓励使用
弃用：现在首选模块化的 BasePrompt 架构
迁移：请参阅提示词系统迁移部分

之前 (v0.3) - 已弃用的方法

from ragas.prompt.pydantic_prompt import PydanticPrompt
from pydantic import BaseModel

class Input(BaseModel):
    query: str

class Output(BaseModel):
    is_relevant: bool

class RelevancePrompt(PydanticPrompt[Input, Output]):
    instruction = "Is this relevant?"
    input_model = Input
    output_model = Output

之后 (v0.4) - 推荐的方法

# Use BasePrompt classes instead - see Prompt System Migration section
from ragas.metrics.collections.faithfulness.util import FaithfulnessPrompt

class CustomPrompt(FaithfulnessPrompt):
    @property
    def instruction(self):
        return "Your custom instruction here"

旧版指标方法

`single_turn_ascore(sample)` - 已弃用

状态：仅在旧版（非集合）指标上
弃用：使用带 ascore() 的集合指标
时间线：将在所有指标迁移后在未来版本中移除

之前 (v0.3) - 已弃用

sample = SingleTurnSample(user_input="...", response="...", ...)
score = await metric.single_turn_ascore(sample)

之后 (v0.4) - 推荐

result = await metric.ascore(user_input="...", response="...")
score = result.value

上下文利用率 (ContextUtilization)

ContextUtilization 现在是 ContextPrecisionWithoutReference 的一个包装器，以实现向后兼容

之前 (v0.3)

from ragas.metrics import ContextUtilization
metric = ContextUtilization(llm=llm)
score = await metric.single_turn_ascore(sample)

之后 (v0.4)

from ragas.metrics.collections import ContextUtilization
# or use the modern name directly:
from ragas.metrics.collections import ContextPrecisionWithoutReference

metric = ContextUtilization(llm=llm)  # Still works (wrapper)
# or
metric = ContextPrecisionWithoutReference(llm=llm)  # Preferred

result = await metric.ascore(
    user_input="...",
    response="...",
    retrieved_contexts=[...]
)
score = result.value

破坏性变更摘要

这是 v0.3 和 v0.4 之间破坏性变更的完整列表

变更	v0.3	v0.4	迁移
评估方法	`evaluate()` 函数	`@experiment()` 装饰器	请参阅从评估到实验
指标位置	`ragas.metrics`	`ragas.metrics.collections`	更新导入路径
评分方法	`single_turn_ascore(sample)`	`ascore(**kwargs)`	更改方法调用
分数返回类型	`float`	`MetricResult`	使用 `.value` 属性
LLM 工厂	`instructor_llm_factory()`	`llm_factory()`	使用统一的工厂
嵌入方法	包装类 (LangChain)	原生提供商	请参阅嵌入迁移
嵌入方法	`embed_query()`, `embed_documents()`	`embed_text()`, `embed_texts()`	更新方法调用
ground_truths 参数	`ground_truths: list[str]`	`reference: str`	重命名，更改类型
样本类型	`SingleTurnSample`	`SingleTurnSample` (已更新)	更新样本创建
提示词系统	基于数据类	基于函数	重构自定义提示词

弃用和移除

在 v0.4 中移除

这些功能已完全移除，将导致错误

instructor_llm_factory() - 改用 llm_factory()
AspectCritic from collections - 无直接替代品
SimpleCriteriaScore from collections - 无直接替代品
答案相似度 (AnswerSimilarity) - 请改用 SemanticSimilarity

已弃用（将在未来版本中移除）

这些功能仍然有效，但会显示弃用警告

LangchainLLMWrapper - 直接使用 llm_factory()
LlamaIndexLLMWrapper - 直接使用 llm_factory()
旧版提示词类 - 迁移到基于函数的提示词
旧版指标上的 single_turn_ascore() - 使用带 ascore() 的集合指标

v0.4 中的新功能（参考）

v0.4 引入了一些超出迁移要求的新功能。虽然对于从 v0.3 迁移不是必需的，但这些功能可能对您的升级有用

支持 GPT-5 和 o 系列 - 为最新的 OpenAI 模型自动处理约束
通用提供商支持 - 单一的 llm_factory() 适用于所有主要提供商（Anthropic、Google、Azure 等）
基于函数的提示词 - 更灵活、更易于组合的提示词定义
指标装饰器 - 通过 @discrete_metric、@numeric_metric、@ranking_metric 简化自定义指标的创建
带推理过程的 MetricResult - 带有可选解释的结构化结果
增强的指标保存/加载 - 轻松序列化指标配置
更好的嵌入支持 - 支持同步和异步嵌入操作

有关新功能的详细信息，请参阅 v0.4 发布说明。

自定义指标迁移

如果您正在使用像 AspectCritic 或 SimpleCriteria 这样已被移除的指标，v0.4 提供了基于装饰器的替代方案来替换它们。您也可以为其他自定义指标使用新的简化指标系统。

离散指标（分类输出）

之前 (v0.3) - AspectCritic

from ragas.metrics import AspectCritic
metric = AspectCritic(name="clarity", allowed_values=["clear", "unclear"])
result = await metric.single_turn_ascore(sample)

之后 (v0.4) - @discrete_metric 装饰器

from ragas.metrics import discrete_metric

@discrete_metric(name="clarity", allowed_values=["clear", "unclear"])
def clarity(response: str) -> str:
    return "clear" if len(response) > 50 else "unclear"

metric = clarity()
result = await metric.ascore(response="...")
print(result.value)  # "clear" or "unclear"

对于任何分类任务，请使用离散指标。所有被移除的指标（AspectCritic、SimpleCriteria）都可以通过这种方式替换。

数值指标（连续值）

对于任何在数值尺度上的评分，请使用 @numeric_metric

from ragas.metrics import numeric_metric

@numeric_metric(name="length_score", allowed_values=(0.0, 1.0))
def length_score(response: str) -> float:
    return min(len(response) / 500, 1.0)

# Custom range
@numeric_metric(name="quality_score", allowed_values=(0.0, 10.0))
def quality_score(response: str) -> float:
    return 7.5

metric = length_score()
result = await metric.ascore(response="...")
print(result.value)  # float between 0 and 1

排名指标（有序列表）

使用 @ranking_metric 对多个项目进行排名或排序

from ragas.metrics import ranking_metric

@ranking_metric(name="context_rank", allowed_values=5)
def context_ranking(question: str, contexts: list[str]) -> list[str]:
    """Rank contexts by relevance."""
    scored = [(len(set(question.split()) & set(c.split())), c) for c in contexts]
    return [c for _, c in sorted(scored, reverse=True)]

metric = context_ranking()
result = await metric.ascore(question="...", contexts=[...])
print(result.value)  # Ranked list

摘要

这些装饰器提供自动验证、类型安全、错误处理和结果包装——将 v0.3 中 50 多行的自定义指标代码减少到 v0.4 中的仅 5-10 行。

常见问题与解决方案

问题：ImportError for `instructor_llm_factory`

错误

ImportError: cannot import name 'instructor_llm_factory' from 'ragas.llms'

解决方案

# Instead of this
from ragas.llms import instructor_llm_factory

# Use this
from ragas.llms import llm_factory

问题：指标返回 `MetricResult` 而不是浮点数

错误

score = await metric.ascore(...)
print(score)  # Prints: MetricResult(value=0.85, reason=None)

解决方案

result = await metric.ascore(...)
score = result.value  # Access the float value
print(score)  # Prints: 0.85

问题：`SingleTurnSample` 缺少 `ground_truths`

错误

TypeError: ground_truths is not a valid keyword

解决方案

# Change from
sample = SingleTurnSample(..., ground_truths=["correct"])

# To
sample = SingleTurnSample(..., reference="correct")

获取帮助

如果您在迁移过程中遇到问题

查看文档
GitHub 问题
- 搜索现有问题
- 创建一个包含迁移特定细节的新问题
社区支持
- 加入我们的 Discord 社区
- 与维护者安排通话

摘要

v0.4 代表了向基于实验的架构的根本性转变，从而更好地整合评估、分析和迭代工作流。虽然存在破坏性变更，但它们都是为了使 Ragas 成为一个更好的实验平台。

迁移路径是直截了当的

更新 LLM 初始化以使用 llm_factory()
从 ragas.metrics.collections 导入指标
用 ascore() 替换 single_turn_ascore()
将 ground_truths 重命名为 reference
处理 MetricResult 对象而不是浮点数

这些技术变更带来了

更好的实验 - 带有推理过程的结构化指标结果，用于更深入的分析
更清晰的 API - 关键字参数代替样本对象，使组合更容易
集成的工作流 - 专为在实验管道中无缝工作而设计的指标
增强的功能 - 通用提供商支持和自动约束
面向未来 - 基于行业标准（instructor 库，标准化模式）

基于实验的架构将在未来的版本中继续改进，提供更多用于管理、分析和迭代评估的功能。

祝您迁移顺利！如果您遇到困难，我们随时提供帮助。🎉

从 v0.3 迁移到 v0.4

主要变更概述

关键统计数据

理解基于实验的架构

迁移路径

从评估（Evaluation）到实验（Experiment）

变更内容

之前 (v0.3)

之后 (v0.4)

使用 experiment() 的好处

LLM 初始化

变更内容

之前 (v0.3)

之后 (v0.4)

迁移步骤

LLM 包装类（已弃用）

指标迁移

指标变更原因

架构变更

基类变更

评分工作流

v0.4 中可用的指标

RAG 评估指标

文本比较指标

基于字符串的指标（非 LLM）

摘要指标

已移除的指标（不再可用）

Agent 和工具指标（尚未迁移）

SQL 指标（尚未迁移）

通用和评分标准指标（尚未迁移）

专门化指标（尚未迁移）

分步迁移

步骤 1：更新导入

步骤 2：初始化指标（无需更改）

步骤 3：更新指标评分调用

步骤 4：处理 MetricResult 对象

特定指标的迁移

忠实度

答案相关性 (AnswerRelevancy)

AnswerCorrectness

上下文精确率 (ContextPrecision)

提示词系统迁移

提示词变更原因

架构变更

基础提示词系统

v0.4 中可用的指标提示词

分步迁移

步骤 1：访问指标中的提示词

步骤 2：查看提示词字符串

步骤 3：自定义提示词（如果需要）

常见的提示词自定义

更改指令

添加领域示例

更改输出格式

验证自定义提示词

从 v0.3 自定义提示词迁移

使用 BasePrompt.adapt() 进行语言适配

之前 (v0.3) - PromptMixin 方法

之后 (v0.4) - BasePrompt.adapt() 方法

语言适配示例

从 v0.3 迁移到 v0.4

完整的迁移示例

数据模式变更

SingleTurnSample 更新

ground_truths → reference

更新后的模式

EvaluationDataset 更新

自定义指标

对于使用基于集合架构的指标

对于使用旧版架构的指标

提示词系统更新

v0.3 - 基于数据类（Dataclass）的提示词

v0.4 - 基于函数的提示词

已移除的功能

函数

评估指标

已弃用的方法（在 v0.4 中移除）

已弃用的功能

evaluate() 函数 - 已弃用

LLM 包装类

使用 `experiment()` 的好处

`ground_truths` → `reference`

`single_turn_ascore(sample)` - 已弃用

问题：ImportError for `instructor_llm_factory`

问题：指标返回 `MetricResult` 而不是浮点数

问题：`SingleTurnSample` 缺少 `ground_truths`