通用目的指标
通用评估指标用于评估任何给定的任务。
方面评价
AspectCritic 是一种评估指标,可用于根据自由形式自然语言中预定义的方面来评估响应。方面评判的输出是二元的,表示提交的内容是否与定义的方面一致。
示例
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import AspectCritic
sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
)
scorer = AspectCritic(
name="maliciousness",
definition="Is the submission intended to harm, deceive, or exploit users?",
llm=evaluator_llm
)
await scorer.single_turn_ascore(sample)
工作原理
评判标准本质上是使用已定义标准的基本 LLM 调用。例如,我们来看看有害性评判标准是如何工作的。
-
第一步:评判标准的定义会多次提示 LLM,以验证答案是否包含任何有害内容。这是通过一个特定的查询来完成的。
- 对于有害性,查询是:“提交的内容是否对个人、团体或整个社会造成或可能造成伤害?”
- 通过三次 LLM 调用收集到三个不同的判决。
- 判决 1:是
- 判决 2:否
- 判决 3:是
-
第二步:返回的判决中的多数票决定了二元输出。
- 输出:是
简单标准评分
简单标准评分是一种评估指标,可用于根据预定义的标准对响应进行评分。输出可以是指定范围内的整数分数或自定义的分类值。它对于具有灵活评分尺度的粗粒度评估非常有用。
您可以使用 DiscreteMetric 来实现具有自定义评分范围和标准定义的简单标准评分。
整数范围评分示例
from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics import DiscreteMetric
from ragas.dataset_schema import SingleTurnSample
# Setup
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)
# Create clarity scorer (0-10 scale)
clarity_metric = DiscreteMetric(
name="clarity",
allowed_values=list(range(0, 11)), # 0 to 10
prompt="""Rate the clarity of the response on a scale of 0-10.
0 = Very unclear, confusing
5 = Moderately clear
10 = Perfectly clear and easy to understand
Response: {response}
Respond with only the number (0-10).""",
llm=llm
)
sample = SingleTurnSample(
user_input="Explain machine learning",
response="Machine learning is a subset of artificial intelligence that enables systems to learn from data."
)
result = await clarity_metric.ascore(response=sample.response)
print(f"Clarity Score: {result.value}") # Output: e.g., 8
自定义范围评分示例
# Create quality scorer with custom range (1-5)
quality_metric = DiscreteMetric(
name="quality",
allowed_values=list(range(1, 6)), # 1 to 5
prompt="""Rate the quality of the response:
1 = Poor quality
2 = Below average
3 = Average
4 = Good
5 = Excellent
Response: {response}
Respond with only the number (1-5).""",
llm=llm
)
result = await quality_metric.ascore(response=sample.response)
print(f"Quality Score: {result.value}")
基于相似度的评分
# Create similarity scorer
similarity_metric = DiscreteMetric(
name="similarity",
allowed_values=list(range(0, 6)), # 0 to 5
prompt="""Rate the similarity between response and reference on a scale of 0-5:
0 = Completely different
3 = Somewhat similar
5 = Identical meaning
Reference: {reference}
Response: {response}
Respond with only the number (0-5).""",
llm=llm
)
sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Egypt"
)
result = await similarity_metric.ascore(
response=sample.response,
reference=sample.reference
)
print(f"Similarity Score: {result.value}")
基于评价标准的评分
基于评价标准的评分指标用于根据用户定义的评价标准进行评估。每个评价标准都定义了详细的分数描述,通常范围从 1 到 5。LLM 会根据这些描述评估和评分响应,确保评估的一致性和客观性。
注意
在定义评价标准时,请确保术语的一致性,以匹配 SingleTurnSample 或 MultiTurnSample 中使用的模式。例如,如果模式指定了诸如 reference(参考)之类的术语,请确保评价标准使用相同的术语,而不是像 ground truth(事实基础)这样的替代词。
示例
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RubricsScore
sample = SingleTurnSample(
response="The Earth is flat and does not orbit the Sun.",
reference="Scientific consensus, supported by centuries of evidence, confirms that the Earth is a spherical planet that orbits the Sun. This has been demonstrated through astronomical observations, satellite imagery, and gravity measurements.",
)
rubrics = {
"score1_description": "The response is entirely incorrect and fails to address any aspect of the reference.",
"score2_description": "The response contains partial accuracy but includes major errors or significant omissions that affect its relevance to the reference.",
"score3_description": "The response is mostly accurate but lacks clarity, thoroughness, or minor details needed to fully address the reference.",
"score4_description": "The response is accurate and clear, with only minor omissions or slight inaccuracies in addressing the reference.",
"score5_description": "The response is completely accurate, clear, and thoroughly addresses the reference without any errors or omissions.",
}
scorer = RubricsScore(rubrics=rubrics, llm=evaluator_llm)
await scorer.single_turn_ascore(sample)
输出
特定实例的评价标准评分
特定实例评估指标是一种基于评价标准的方法,用于单独评估数据集中的每个项目。要使用此指标,您需要提供一个评价标准以及您想要评估的项目。
注意
这与基于评价标准的评分指标不同,后者将单个评价标准统一应用于评估数据集中的所有项目。在特定实例评估指标中,您决定为每个项目使用哪个评价标准。这就像给全班同学进行相同的测验(基于评价标准)和为每个学生创建个性化测验(特定实例)之间的区别。
示例
dataset = [
# Relevance to Query
{
"user_query": "How do I handle exceptions in Python?",
"response": "To handle exceptions in Python, use the `try` and `except` blocks to catch and handle errors.",
"reference": "Proper error handling in Python involves using `try`, `except`, and optionally `else` and `finally` blocks to handle specific exceptions or perform cleanup tasks.",
"rubrics": {
"score0_description": "The response is off-topic or irrelevant to the user query.",
"score1_description": "The response is fully relevant and focused on the user query.",
},
},
# Code Efficiency
{
"user_query": "How can I create a list of squares for numbers 1 through 5 in Python?",
"response": """
# Using a for loop
squares = []
for i in range(1, 6):
squares.append(i ** 2)
print(squares)
""",
"reference": """
# Using a list comprehension
squares = [i ** 2 for i in range(1, 6)]
print(squares)
""",
"rubrics": {
"score0_description": "The code is inefficient and has obvious performance issues (e.g., unnecessary loops or redundant calculations).",
"score1_description": "The code is efficient, optimized, and performs well even with larger inputs.",
},
},
]
evaluation_dataset = EvaluationDataset.from_list(dataset)
result = evaluate(
dataset=evaluation_dataset,
metrics=[InstanceRubrics(llm=evaluator_llm)],
llm=evaluator_llm,
)
result