使 LLM 评估器与人类判断对齐

本教程是关于如何将 Vertex AI 模型与 Ragas 结合使用的三部分系列之一。建议您先学习入门：Ragas 与 Vertex AI，即使不学也能轻松跟上。您可以通过链接导航到模型比较教程。

概述

在本教程中，您将学习如何使用 Ragas 训练和对齐您自己的自定义 LLM 指标。尽管基于 LLM 的评估器为 AI 应用评分提供了一种强大的方法，但由于风格、上下文或细微差别的差异，它们有时可能会产生与人类预期不同的判断。通过遵循本指南，您将优化您的指标，使其更准确地反映人类判断。

在本教程中，您将

使用 Ragas 定义一个基于模型的指标。
从 HHH 数据集的“helpful”子集构建一个 EvaluationDataset。
运行初始评估以衡量指标的性能。
审查并标注 15–20 个评估示例。
使用您标注的数据训练指标。
重新评估指标，观察与人类判断对齐的改进。

入门

安装依赖项

%pip install --upgrade --user --quiet langchain-core langchain-google-vertexai langchain ragas

重启运行时

要在此 Jupyter 运行时中使用新安装的包，您必须重启运行时。您可以通过运行以下单元格来实现，该单元格将重启当前内核。

重启可能需要一分钟或更长时间。重启后，继续下一步。

import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

验证您的 notebook 环境（仅限 Colab）

如果您在 Google Colab 上运行此 notebook，请运行以下单元格以验证您的环境。

import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

设置 Google Cloud 项目信息并初始化 Vertex AI SDK

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    raise ValueError("Please set your PROJECT_ID")


import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

设置评估指标

基于 LLM 的指标潜力巨大，但有时与人类评估者相比可能会错误判断响应。为了弥合这一差距，我们使用反馈循环将基于模型的指标与人类判断对齐。

定义 evaluator_llm

导入所需的封装器并定义您的 evaluator LLM 和 embedder。

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_google_vertexai import VertexAI, VertexAIEmbeddings


evaluator_llm = LangchainLLMWrapper(VertexAI(model_name="gemini-2.0-flash-001"))
evaluator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(model_name="text-embedding-004"))

Ragas 指标

Ragas 提供了各种基于模型的指标，可以进行微调以与人类评估者对齐。作为演示，我们将使用 Aspect Critic 指标——一个用户定义的二元指标。有关更多详细信息，请参阅 Aspect Critic 文档。

from ragas.metrics import AspectCritic

helpfulness_critic = AspectCritic(
    name="helpfulness",
    definition="Evaluate how helpful the assistant's response is to the user's query.",
    llm=evaluator_llm
)

您可以通过运行以下命令预览将传递给 LLM 的提示词（对齐前）

print(helpfulness_critic.get_prompts()["single_turn_aspect_critic_prompt"].instruction)

输出

Evaluate the Input based on the criterial defined. Use only 'Yes' (1) and 'No' (0) as verdict.
Criteria Definition: Evaluate how helpful the assistant's response is to the user's query.

定义对齐分数

由于我们使用二元指标，我们将使用 F1-score 衡量对齐程度。但是，根据您正在对齐的指标，您可以相应地修改此函数，使用其他方法来衡量对齐程度。

from typing import List
from sklearn.metrics import f1_score

def alignment_score(human_score: List[float], llm_score: List[float]) -> float:
    """
    Computes the alignment between human-annotated binary scores and LLM-generated binary scores
    using the F1-score metric.

    Args:
        human_score (List[int]): Binary labels from human evaluation (0 or 1).
        llm_score (List[int]): Binary labels from LLM predictions (0 or 1).

    Returns:
        float: The F1-score measuring alignment.
    """
    return f1_score(human_score, llm_score)

准备数据集

process_hhh_dataset 函数准备来自 HHH 数据集的数据，用于训练和对齐 LLM 评估器。每个示例被交替分配 0 和 1 分（1 表示有帮助，0 表示无帮助），指示更偏好哪个响应。

import numpy as np
from datasets import load_dataset
from ragas import EvaluationDataset


def process_hhh_dataset(split: str = "helpful", total_count: int = 50):
    dataset = load_dataset("HuggingFaceH4/hhh_alignment",split, split=f"test[:{total_count}]")
    data = []
    expert_scores = []

    for idx, entry in enumerate(dataset):
        # Extract input and target details
        user_input = entry['input']
        choices = entry['targets']['choices']
        labels = entry['targets']['labels']

        # Choose target based on whether the index is even or odd
        if idx % 2 == 0:
            target_label = 1
            score = 1
        else:
            target_label = 0
            score = 0

        label_index = labels.index(target_label)

        response = choices[label_index]

        data.append({
            'user_input': user_input,
            'response': response,
        })
        expert_scores.append(score)

    return EvaluationDataset.from_list(data), expert_scores

eval_dataset, expert_scores = process_hhh_dataset()

运行评估

定义了评估数据集和帮助度指标后，您现在可以运行评估

from ragas import evaluate

results = evaluate(eval_dataset, metrics=[helpfulness_critic])

Evaluating: 100%|██████████| 50/50 [00:00<?, ?it/s]

这次初始运行突出显示了基于 LLM 的评估器中存在的未对齐程度，随后的训练将解决这个问题。

接下来，将指标的性能与专家评分进行基准测试

human_score = expert_scores
llm_score = results.to_pandas()["helpfulness"].values

initial_score = alignment_score(human_score, llm_score)
initial_score

输出

0.8076923076923077

审查和标注

现在您已经获得了评估结果，是时候审查并标注它们了。正如博客使 LLM 作为判断者与人类评估者对齐中所讨论的，收集详细反馈对于弥合基于 LLM 的评估与人类评估之间的差距至关重要。标注至少 15–20 个示例，以捕捉指标可能未对齐的不同场景。

以下是上述示例的样本标注。您可以下载并使用它。

训练和对齐

下一步是使用标注的示例训练您的指标。此训练过程利用无梯度提示词优化方法，根据标注反馈调整指令和少量示例演示。

from ragas.config import InstructionConfig, DemonstrationConfig

demo_config = DemonstrationConfig(embedding=evaluator_embeddings)
inst_config = InstructionConfig(llm=evaluator_llm)

helpfulness_critic.train(
    path="annotated_data.json",
    instruction_config=inst_config,
    demonstration_config=demo_config,
)

Overall Progress: 100%|██████████| 170/170 [00:00<?, ?it/s]

Few-shot examples [single_turn_aspect_critic_prompt]: 100%|██████████| 16/16 [00:00<?, ?it/s]

训练后，审查已为指标优化的更新指令

print(helpfulness_critic.get_prompts()["single_turn_aspect_critic_prompt"].instruction)

输出

You are provided with a user input and an assistant/model response. Your task is to evaluate the quality of the response based on how well it addresses the user input, considering all requests and constraints. Assign a score/verdict of 1 if the response is helpful, appropriate, and effective, and 0 if it is not. A good response should be accurate, complete, relevant, and provide a tangible improvement or solution, without omitting key information. Provide a brief explanation for your score/verdict.

重新评估

现在您的指标已与人类反馈对齐，请在您的数据集上重新运行评估。此步骤允许您对改进进行基准测试，并量化对齐过程如何提高了指标的可靠性。

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_google_vertexai import VertexAI, VertexAIEmbeddings


evaluator_llm = LangchainLLMWrapper(VertexAI(model_name="gemini-pro"))
evaluator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(model_name="text-embedding-004"))

from ragas import evaluate

results2 = evaluate(eval_dataset, metrics=[helpfulness_critic])

Evaluating: 100%|██████████| 50/50 [00:00<?, ?it/s]

将更新后的结果与专家评分进行基准测试

human_score = expert_scores
llm_score = results2.to_pandas()["helpfulness"].values

new_score = alignment_score(human_score, llm_score)
new_score

输出

0.8444444444444444

查看本系列的其他教程

Ragas 与 Vertex AI：了解如何将 Vertex AI 模型与 Ragas 结合使用来评估您的 LLM 工作流。
模型比较：使用 Ragas 指标比较 VertexAI 在基于 RAG 的问答任务中提供的模型。