跳到内容

领域特定评估

领域特定评估指标是一种基于评分标准的评估指标,用于评估模型在特定领域的性能。评分标准包含每个分数的描述,通常范围从 1 到 5。这里的响应是使用 LLM 根据评分标准中指定的描述进行评估和评分。该指标还具有无参考和基于参考的变体。

例如,在 RAG 中,如果您有 questioncontextsanswerground_truth(可选),则可以根据领域确定评分标准(或使用 Ragas 提供的默认评分标准),并使用此指标评估模型。

示例

from ragas import evaluate
from datasets import Dataset, DatasetDict

from ragas.metrics import rubrics_score_without_reference, rubrics_score_with_reference

rows = {
    "question": [
        "What's the longest river in the world?",
    ],
    "ground_truth": [
        "The Nile is a major north-flowing river in northeastern Africa.",
    ],
    "answer": [
        "The longest river in the world is the Nile, stretching approximately 6,650 kilometers (4,130 miles) through northeastern Africa, flowing through countries such as Uganda, Sudan, and Egypt before emptying into the Mediterranean Sea. There is some debate about this title, as recent studies suggest the Amazon River could be longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers (4,350 miles).",
    ],
    "contexts": [
        [
            "Scientists debate whether the Amazon or the Nile is the longest river in the world. Traditionally, the Nile is considered longer, but recent information suggests that the Amazon may be longer.",
            "The Nile River was central to the Ancient Egyptians' rise to wealth and power. Since rainfall is almost non-existent in Egypt, the Nile River and its yearly floodwaters offered the people a fertile oasis for rich agriculture.",
            "The world's longest rivers are defined as the longest natural streams whose water flows within a channel, or streambed, with defined banks.",
            "The Amazon River could be considered longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers."
        ],
    ]
}



dataset = Dataset.from_dict(rows)

result = evaluate(
    dataset,
    metrics=[
        rubrics_score_without_reference,
        rubrics_score_with_reference
    ],
)

这里使用无参考和基于参考的评分标准进行评估。您还可以通过在 rubric 参数中定义自己的评分标准来声明和使用它。

from ragas.metrics.rubrics import RubricsScoreWithReference

my_custom_rubrics = {
    "score1_description": "answer and ground truth are completely different",
    "score2_description": "answer and ground truth are somewhat different",
    "score3_description": "answer and ground truth are somewhat similar",
    "score4_description": "answer and ground truth are similar",
    "score5_description": "answer and ground truth are exactly the same",
}

rubrics_score_with_reference = RubricsScoreWithReference(rubrics=my_custom_rubrics)