跳转到内容

领域特定评估

领域特定评估指标是一种基于评分准则的评估指标,用于评估模型在特定领域的性能。该评分准则包含对每个分数(通常从1到5)的描述。模型的响应将由大语言模型(LLM)根据评分准则中指定的描述进行评估和打分。该指标同时提供无参考和有参考的两种变体。

例如,在 RAG 中,如果您有 question(问题)、contexts(上下文)、answer(答案)和 ground_truth(真实答案,可选),您可以根据具体领域来决定评分准则(或使用 ragas 提供的默认准则),并使用此指标来评估模型。

示例

from ragas import evaluate
from datasets import Dataset, DatasetDict

from ragas.metrics import rubrics_score_without_reference, rubrics_score_with_reference

rows = {
    "question": [
        "What's the longest river in the world?",
    ],
    "ground_truth": [
        "The Nile is a major north-flowing river in northeastern Africa.",
    ],
    "answer": [
        "The longest river in the world is the Nile, stretching approximately 6,650 kilometers (4,130 miles) through northeastern Africa, flowing through countries such as Uganda, Sudan, and Egypt before emptying into the Mediterranean Sea. There is some debate about this title, as recent studies suggest the Amazon River could be longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers (4,350 miles).",
    ],
    "contexts": [
        [
            "Scientists debate whether the Amazon or the Nile is the longest river in the world. Traditionally, the Nile is considered longer, but recent information suggests that the Amazon may be longer.",
            "The Nile River was central to the Ancient Egyptians' rise to wealth and power. Since rainfall is almost non-existent in Egypt, the Nile River and its yearly floodwaters offered the people a fertile oasis for rich agriculture.",
            "The world's longest rivers are defined as the longest natural streams whose water flows within a channel, or streambed, with defined banks.",
            "The Amazon River could be considered longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers."
        ],
    ]
}



dataset = Dataset.from_dict(rows)

result = evaluate(
    dataset,
    metrics=[
        rubrics_score_without_reference,
        rubrics_score_with_reference
    ],
)

此处的评估同时使用了无参考和有参考的评分准则。您也可以通过在 rubric 参数中定义评分准则来声明和使用您自己的准则。

from ragas.metrics.rubrics import RubricsScoreWithReference

my_custom_rubrics = {
    "score1_description": "answer and ground truth are completely different",
    "score2_description": "answer and ground truth are somewhat different",
    "score3_description": "answer and ground truth are somewhat similar",
    "score4_description": "answer and ground truth are similar",
    "score5_description": "answer and ground truth are exactly the same",
}

rubrics_score_with_reference = RubricsScoreWithReference(rubrics=my_custom_rubrics)