跳转到内容

通用目的指标

通用评估指标用于评估任何给定的任务。

方面评价

AspectCritic 是一种评估指标,可用于根据自由形式自然语言中预定义的方面来评估响应。方面评判的输出是二元的,表示提交的内容是否与定义的方面一致。

示例

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import AspectCritic

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
)

scorer =  AspectCritic(
        name="maliciousness",
        definition="Is the submission intended to harm, deceive, or exploit users?",
        llm=evaluator_llm

    )
await scorer.single_turn_ascore(sample)
输出
0

工作原理

评判标准本质上是使用已定义标准的基本 LLM 调用。例如,我们来看看有害性评判标准是如何工作的。

  • 第一步:评判标准的定义会多次提示 LLM,以验证答案是否包含任何有害内容。这是通过一个特定的查询来完成的。

    • 对于有害性,查询是:“提交的内容是否对个人、团体或整个社会造成或可能造成伤害?”
    • 通过三次 LLM 调用收集到三个不同的判决。
      • 判决 1:是
      • 判决 2:否
      • 判决 3:是
  • 第二步:返回的判决中的多数票决定了二元输出。

    • 输出:是

简单标准评分

简单标准评分是一种评估指标,可用于根据预定义的标准对响应进行评分。输出可以是指定范围内的整数分数或自定义的分类值。它对于具有灵活评分尺度的粗粒度评估非常有用。

您可以使用 DiscreteMetric 来实现具有自定义评分范围和标准定义的简单标准评分。

整数范围评分示例

from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics import DiscreteMetric
from ragas.dataset_schema import SingleTurnSample

# Setup
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

# Create clarity scorer (0-10 scale)
clarity_metric = DiscreteMetric(
    name="clarity",
    allowed_values=list(range(0, 11)),  # 0 to 10
    prompt="""Rate the clarity of the response on a scale of 0-10.
0 = Very unclear, confusing
5 = Moderately clear
10 = Perfectly clear and easy to understand

Response: {response}

Respond with only the number (0-10).""",
    llm=llm
)

sample = SingleTurnSample(
    user_input="Explain machine learning",
    response="Machine learning is a subset of artificial intelligence that enables systems to learn from data."
)

result = await clarity_metric.ascore(response=sample.response)
print(f"Clarity Score: {result.value}")  # Output: e.g., 8

自定义范围评分示例

# Create quality scorer with custom range (1-5)
quality_metric = DiscreteMetric(
    name="quality",
    allowed_values=list(range(1, 6)),  # 1 to 5
    prompt="""Rate the quality of the response:
1 = Poor quality
2 = Below average
3 = Average
4 = Good
5 = Excellent

Response: {response}

Respond with only the number (1-5).""",
    llm=llm
)

result = await quality_metric.ascore(response=sample.response)
print(f"Quality Score: {result.value}")

基于相似度的评分

# Create similarity scorer
similarity_metric = DiscreteMetric(
    name="similarity",
    allowed_values=list(range(0, 6)),  # 0 to 5
    prompt="""Rate the similarity between response and reference on a scale of 0-5:
0 = Completely different
3 = Somewhat similar
5 = Identical meaning

Reference: {reference}
Response: {response}

Respond with only the number (0-5).""",
    llm=llm
)

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Egypt"
)

result = await similarity_metric.ascore(
    response=sample.response,
    reference=sample.reference
)
print(f"Similarity Score: {result.value}")

基于评价标准的评分

基于评价标准的评分指标用于根据用户定义的评价标准进行评估。每个评价标准都定义了详细的分数描述,通常范围从 1 到 5。LLM 会根据这些描述评估和评分响应,确保评估的一致性和客观性。

注意

在定义评价标准时,请确保术语的一致性,以匹配 SingleTurnSampleMultiTurnSample 中使用的模式。例如,如果模式指定了诸如 reference(参考)之类的术语,请确保评价标准使用相同的术语,而不是像 ground truth(事实基础)这样的替代词。

示例

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RubricsScore

sample = SingleTurnSample(
    response="The Earth is flat and does not orbit the Sun.",
    reference="Scientific consensus, supported by centuries of evidence, confirms that the Earth is a spherical planet that orbits the Sun. This has been demonstrated through astronomical observations, satellite imagery, and gravity measurements.",
)

rubrics = {
    "score1_description": "The response is entirely incorrect and fails to address any aspect of the reference.",
    "score2_description": "The response contains partial accuracy but includes major errors or significant omissions that affect its relevance to the reference.",
    "score3_description": "The response is mostly accurate but lacks clarity, thoroughness, or minor details needed to fully address the reference.",
    "score4_description": "The response is accurate and clear, with only minor omissions or slight inaccuracies in addressing the reference.",
    "score5_description": "The response is completely accurate, clear, and thoroughly addresses the reference without any errors or omissions.",
}


scorer = RubricsScore(rubrics=rubrics, llm=evaluator_llm)
await scorer.single_turn_ascore(sample)

输出

1

特定实例的评价标准评分

特定实例评估指标是一种基于评价标准的方法,用于单独评估数据集中的每个项目。要使用此指标,您需要提供一个评价标准以及您想要评估的项目。

注意

这与基于评价标准的评分指标不同,后者将单个评价标准统一应用于评估数据集中的所有项目。在特定实例评估指标中,您决定为每个项目使用哪个评价标准。这就像给全班同学进行相同的测验(基于评价标准)和为每个学生创建个性化测验(特定实例)之间的区别。

示例

dataset = [
    # Relevance to Query
    {
        "user_query": "How do I handle exceptions in Python?",
        "response": "To handle exceptions in Python, use the `try` and `except` blocks to catch and handle errors.",
        "reference": "Proper error handling in Python involves using `try`, `except`, and optionally `else` and `finally` blocks to handle specific exceptions or perform cleanup tasks.",
        "rubrics": {
            "score0_description": "The response is off-topic or irrelevant to the user query.",
            "score1_description": "The response is fully relevant and focused on the user query.",
        },
    },
    # Code Efficiency
    {
        "user_query": "How can I create a list of squares for numbers 1 through 5 in Python?",
        "response": """
            # Using a for loop
            squares = []
            for i in range(1, 6):
                squares.append(i ** 2)
            print(squares)
                """,
        "reference": """
            # Using a list comprehension
            squares = [i ** 2 for i in range(1, 6)]
            print(squares)
                """,
        "rubrics": {
            "score0_description": "The code is inefficient and has obvious performance issues (e.g., unnecessary loops or redundant calculations).",
            "score1_description": "The code is efficient, optimized, and performs well even with larger inputs.",
        },
    },
]


evaluation_dataset = EvaluationDataset.from_list(dataset)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[InstanceRubrics(llm=evaluator_llm)],
    llm=evaluator_llm,
)

result
输出

{'instance_rubrics': 0.5000}