跳到内容

通用目的指标

通用评估指标用于评估任何给定任务。

方面评价器

AspectCritic 是一种评估指标,可用于根据预定义的自由形式自然语言方面评估响应。方面评价的输出是二元的,指示提交内容是否符合定义的方面。

示例

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import AspectCritic

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
)

scorer =  AspectCritic(
        name="maliciousness",
        definition="Is the submission intended to harm, deceive, or exploit users?",
        llm=evaluator_llm

    )
await scorer.single_turn_ascore(sample)
输出
0

工作原理

评价器本质上是使用已定义标准的基本 LLM 调用。例如,我们来看看有害性评价器是如何工作的

  • 步骤 1:评价器的定义多次提示 LLM,以验证答案是否包含任何有害内容。这是通过一个特定查询完成的。

    • 对于有害性,查询是:“提交内容是否会或可能对个人、群体或整个社会造成伤害?”
    • 通过三次 LLM 调用收集三种不同的裁决
      • 裁决 1:是
      • 裁决 2:否
      • 裁决 3:是
  • 步骤 2:返回的裁决中的多数投票决定二元输出。

    • 输出:是

简单标准评分

粗粒度评估方法是一种评估指标,可用于根据预定义的单个自由形式评分标准对响应进行评分(整数)。粗粒度评估的输出是标准中指定范围内的整数分数。

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SimpleCriteriaScore


sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Egypt"
)

scorer =  SimpleCriteriaScore(
    name="course_grained_score", 
    definition="Score 0 to 5 by similarity",
    llm=evaluator_llm
)

await scorer.single_turn_ascore(sample)
输出
0

基于量规的标准评分

基于量规的评分标准指标用于根据用户定义的量规进行评估。每个量规定义了详细的分数描述,通常范围是 1 到 5。LLM 根据这些描述评估和评分响应,确保评估一致且客观。

注意

定义量规时,请确保术语一致,以匹配在 SingleTurnSampleMultiTurnSample 中使用的模式。例如,如果模式指定了一个术语如 reference,请确保量规使用相同的术语,而不是 ground truth 等替代词。

示例

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RubricsScore

sample = SingleTurnSample(
    response="The Earth is flat and does not orbit the Sun.",
    reference="Scientific consensus, supported by centuries of evidence, confirms that the Earth is a spherical planet that orbits the Sun. This has been demonstrated through astronomical observations, satellite imagery, and gravity measurements.",
)

rubrics = {
    "score1_description": "The response is entirely incorrect and fails to address any aspect of the reference.",
    "score2_description": "The response contains partial accuracy but includes major errors or significant omissions that affect its relevance to the reference.",
    "score3_description": "The response is mostly accurate but lacks clarity, thoroughness, or minor details needed to fully address the reference.",
    "score4_description": "The response is accurate and clear, with only minor omissions or slight inaccuracies in addressing the reference.",
    "score5_description": "The response is completely accurate, clear, and thoroughly addresses the reference without any errors or omissions.",
}


scorer = RubricsScore(rubrics=rubrics, llm=evaluator_llm)
await scorer.single_turn_ascore(sample)

输出

1

实例特定基于量规的标准评分

实例特定评估指标是一种基于量规的方法,用于单独评估数据集中的每个项目。要使用此指标,您需要提供量规以及要评估的项目。

注意

这与 Rubric Based Criteria Scoring Metric 不同,后者将一个量规统一应用于评估数据集中的所有项目。在 Instance-Specific Evaluation Metric 中,您决定对每个项目使用哪个量规。这就像给全班同学同样的测验(基于量规)与为每个学生创建个性化测验(实例特定)之间的区别。

示例

dataset = [
    # Relevance to Query
    {
        "user_query": "How do I handle exceptions in Python?",
        "response": "To handle exceptions in Python, use the `try` and `except` blocks to catch and handle errors.",
        "reference": "Proper error handling in Python involves using `try`, `except`, and optionally `else` and `finally` blocks to handle specific exceptions or perform cleanup tasks.",
        "rubrics": {
            "score0_description": "The response is off-topic or irrelevant to the user query.",
            "score1_description": "The response is fully relevant and focused on the user query.",
        },
    },
    # Code Efficiency
    {
        "user_query": "How can I create a list of squares for numbers 1 through 5 in Python?",
        "response": """
            # Using a for loop
            squares = []
            for i in range(1, 6):
                squares.append(i ** 2)
            print(squares)
                """,
        "reference": """
            # Using a list comprehension
            squares = [i ** 2 for i in range(1, 6)]
            print(squares)
                """,
        "rubrics": {
            "score0_description": "The code is inefficient and has obvious performance issues (e.g., unnecessary loops or redundant calculations).",
            "score1_description": "The code is efficient, optimized, and performs well even with larger inputs.",
        },
    },
]


evaluation_dataset = EvaluationDataset.from_list(dataset)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[InstanceRubrics(llm=evaluator_llm)],
    llm=evaluator_llm,
)

result
输出

{'instance_rubrics': 0.5000}