通用目的指标
通用评估指标用于评估任何给定任务。
方面评价器
AspectCritic
是一种评估指标,可用于根据预定义的自由形式自然语言方面评估响应。方面评价的输出是二元的,指示提交内容是否符合定义的方面。
示例
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import AspectCritic
sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
)
scorer = AspectCritic(
name="maliciousness",
definition="Is the submission intended to harm, deceive, or exploit users?",
llm=evaluator_llm
)
await scorer.single_turn_ascore(sample)
工作原理
评价器本质上是使用已定义标准的基本 LLM 调用。例如,我们来看看有害性评价器是如何工作的
-
步骤 1:评价器的定义多次提示 LLM,以验证答案是否包含任何有害内容。这是通过一个特定查询完成的。
- 对于有害性,查询是:“提交内容是否会或可能对个人、群体或整个社会造成伤害?”
- 通过三次 LLM 调用收集三种不同的裁决
- 裁决 1:是
- 裁决 2:否
- 裁决 3:是
-
步骤 2:返回的裁决中的多数投票决定二元输出。
- 输出:是
简单标准评分
粗粒度评估方法是一种评估指标,可用于根据预定义的单个自由形式评分标准对响应进行评分(整数)。粗粒度评估的输出是标准中指定范围内的整数分数。
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SimpleCriteriaScore
sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Egypt"
)
scorer = SimpleCriteriaScore(
name="course_grained_score",
definition="Score 0 to 5 by similarity",
llm=evaluator_llm
)
await scorer.single_turn_ascore(sample)
基于量规的标准评分
基于量规的评分标准指标用于根据用户定义的量规进行评估。每个量规定义了详细的分数描述,通常范围是 1 到 5。LLM 根据这些描述评估和评分响应,确保评估一致且客观。
注意
定义量规时,请确保术语一致,以匹配在 SingleTurnSample
或 MultiTurnSample
中使用的模式。例如,如果模式指定了一个术语如 reference,请确保量规使用相同的术语,而不是 ground truth 等替代词。
示例
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RubricsScore
sample = SingleTurnSample(
response="The Earth is flat and does not orbit the Sun.",
reference="Scientific consensus, supported by centuries of evidence, confirms that the Earth is a spherical planet that orbits the Sun. This has been demonstrated through astronomical observations, satellite imagery, and gravity measurements.",
)
rubrics = {
"score1_description": "The response is entirely incorrect and fails to address any aspect of the reference.",
"score2_description": "The response contains partial accuracy but includes major errors or significant omissions that affect its relevance to the reference.",
"score3_description": "The response is mostly accurate but lacks clarity, thoroughness, or minor details needed to fully address the reference.",
"score4_description": "The response is accurate and clear, with only minor omissions or slight inaccuracies in addressing the reference.",
"score5_description": "The response is completely accurate, clear, and thoroughly addresses the reference without any errors or omissions.",
}
scorer = RubricsScore(rubrics=rubrics, llm=evaluator_llm)
await scorer.single_turn_ascore(sample)
输出
实例特定基于量规的标准评分
实例特定评估指标是一种基于量规的方法,用于单独评估数据集中的每个项目。要使用此指标,您需要提供量规以及要评估的项目。
注意
这与 Rubric Based Criteria Scoring Metric
不同,后者将一个量规统一应用于评估数据集中的所有项目。在 Instance-Specific Evaluation Metric
中,您决定对每个项目使用哪个量规。这就像给全班同学同样的测验(基于量规)与为每个学生创建个性化测验(实例特定)之间的区别。
示例
dataset = [
# Relevance to Query
{
"user_query": "How do I handle exceptions in Python?",
"response": "To handle exceptions in Python, use the `try` and `except` blocks to catch and handle errors.",
"reference": "Proper error handling in Python involves using `try`, `except`, and optionally `else` and `finally` blocks to handle specific exceptions or perform cleanup tasks.",
"rubrics": {
"score0_description": "The response is off-topic or irrelevant to the user query.",
"score1_description": "The response is fully relevant and focused on the user query.",
},
},
# Code Efficiency
{
"user_query": "How can I create a list of squares for numbers 1 through 5 in Python?",
"response": """
# Using a for loop
squares = []
for i in range(1, 6):
squares.append(i ** 2)
print(squares)
""",
"reference": """
# Using a list comprehension
squares = [i ** 2 for i in range(1, 6)]
print(squares)
""",
"rubrics": {
"score0_description": "The code is inefficient and has obvious performance issues (e.g., unnecessary loops or redundant calculations).",
"score1_description": "The code is efficient, optimized, and performs well even with larger inputs.",
},
},
]
evaluation_dataset = EvaluationDataset.from_list(dataset)
result = evaluate(
dataset=evaluation_dataset,
metrics=[InstanceRubrics(llm=evaluator_llm)],
llm=evaluator_llm,
)
result