任务指标

摘要分数

SummarizationScore 指标衡量了摘要 (response) 在多大程度上捕捉了 retrieved_contexts 中的重要信息。此指标背后的直觉是，一个好的摘要应该包含上下文中（或者说是文本中）存在的所有重要信息。

我们首先从上下文中提取一组重要的关键词。然后利用这些关键词生成一组问题。对于上下文来说，这些问题的答案始终是 yes(1)。然后我们将这些问题应用于摘要，并将摘要分数计算为正确回答的问题数占问题总数的比例。

我们使用答案计算问答分数，答案是一个包含 1 和 0 的列表。然后将问答分数计算为正确回答的问题（答案 = 1）占问题总数的比例。

\[ \text{问答得分} = \frac{|\text{正确回答的问题数}|}{|\text{问题总数}|} \]

我们还引入了一个选项，通过提供简洁度分数来惩罚较长的摘要。如果启用此选项，则最终分数计算为摘要分数和简洁度分数的加权平均值。此简洁度分数确保仅复制文本的摘要不会获得高分，因为它们显然会正确回答所有问题。

\[ \text{简洁得分} = 1 - \frac{\min(\text{摘要长度}, \text{上下文长度})}{\text{上下文长度} + \text{1e-10}} \]

我们还提供一个系数 coeff（默认值 0.5）来控制分数的权重。

最终摘要分数计算如下：

\[ \text{摘要得分} = \text{问答得分}*\text{(1-系数)} + \\ \text{简洁得分}*\text{系数} \]

示例

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SummarizationScore


sample = SingleTurnSample(
    response="A company is launching a fitness tracking app that helps users set exercise goals, log meals, and track water intake, with personalized workout suggestions and motivational reminders.",
    reference_contexts=[
        "A company is launching a new product, a smartphone app designed to help users track their fitness goals. The app allows users to set daily exercise targets, log their meals, and track their water intake. It also provides personalized workout recommendations and sends motivational reminders throughout the day."
    ]
)

scorer = SummarizationScore(llm=evaluator_llm)
await scorer.single_turn_ascore(sample)

输出

0.6423387096775146