跳转到内容

使用 Ragas 指标比较 VertexAI 在基于 RAG 的问答任务上提供的模型

本教程是关于如何将 Vertex AI 模型与 Ragas 结合使用的三部分系列教程之一。建议您先阅读入门:将 Ragas 与 Vertex AI 结合使用,但即使没有阅读过,您也能顺利完成。您可以点击此处查看“对齐 LLM 指标”教程。

概述

在本教程中,您将学习如何使用 Ragas 为问答 (QA) 任务对不同的 LLM 模型进行评分和评估。然后,可视化并比较评估结果,以选择一个生成模型。

入门指南

安装依赖项

%pip install --upgrade --user --quiet langchain-core langchain-google-vertexai langchain ragas rouge_score

重启运行时

要在此 Jupyter 运行时中使用新安装的包,您必须重新启动运行时。您可以通过运行下面的单元格来完成此操作,它会重新启动当前内核。

重启可能需要一分钟或更长时间。重启后,继续下一步。

import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

验证您的笔记本环境(仅限 Colab)

如果您在 Google Colab 上运行此笔记本,请运行下面的单元格来验证您的环境。

import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

设置 Google Cloud 项目信息并初始化 Vertex AI SDK

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    raise ValueError("Please set your PROJECT_ID")


import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

辅助函数

以下是一些用于显示评估报告和可视化评估结果的辅助函数。

import pandas as pd
import plotly.graph_objects as go
from IPython.display import HTML, Markdown, display


def display_eval_report(eval_result, metrics=None):
    """Display the evaluation results."""

    title, summary_metrics, report_df = eval_result
    metrics_df = pd.DataFrame.from_dict(summary_metrics, orient="index").T
    if metrics:
        metrics_df = metrics_df.filter(
            [
                metric
                for metric in metrics_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )
        report_df = report_df.filter(
            [
                metric
                for metric in report_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    # Display the title with Markdown for emphasis
    display(Markdown(f"## {title}"))

    # Display the metrics DataFrame
    display(Markdown("### Summary Metrics"))
    display(metrics_df)

    # Display the detailed report DataFrame
    display(Markdown("### Report Metrics"))
    display(report_df)


def plot_radar_plot(eval_results, max_score=5, metrics=None):
    fig = go.Figure()

    for eval_result in eval_results:
        title, summary_metrics, report_df = eval_result

        if metrics:
            summary_metrics = {
                k: summary_metrics[k]
                for k, v in summary_metrics.items()
                if any(selected_metric in k for selected_metric in metrics)
            }

        fig.add_trace(
            go.Scatterpolar(
                r=list(summary_metrics.values()),
                theta=list(summary_metrics.keys()),
                fill="toself",
                name=title,
            )
        )

    fig.update_layout(
        polar=dict(radialaxis=dict(visible=True, range=[0, max_score])), showlegend=True
    )

    fig.show()


def plot_bar_plot(eval_results, metrics=None):
    fig = go.Figure()
    data = []

    for eval_result in eval_results:
        title, summary_metrics, _ = eval_result
        if metrics:
            summary_metrics = {
                k: summary_metrics[k]
                for k, v in summary_metrics.items()
                if any(selected_metric in k for selected_metric in metrics)
            }

        data.append(
            go.Bar(
                x=list(summary_metrics.keys()),
                y=list(summary_metrics.values()),
                name=title,
            )
        )

    fig = go.Figure(data=data)

    # Change the bar mode
    fig.update_layout(barmode="group")
    fig.show()

使用 Ragas 指标设置评估

定义 evaluator_llm

要使用基于模型的指标,首先需要定义您的评估器 LLM 和嵌入。

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_google_vertexai import VertexAI, VertexAIEmbeddings


evaluator_llm = LangchainLLMWrapper(VertexAI(model_name="gemini-pro"))
evaluator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(model_name="text-embedding-004"))

Ragas 指标

选择并定义与您的应用最相关的指标。

from ragas import evaluate
from ragas.metrics import ContextPrecision, Faithfulness, RubricsScore, RougeScore

rouge_score = RougeScore()

helpfulness_rubrics = {
    "score1_description": "Response is useless/irrelevant, contains inaccurate/deceptive/misleading information, and/or contains harmful/offensive content. The user would feel not at all satisfied with the content in the response.",
    "score2_description": "Response is minimally relevant to the instruction and may provide some vaguely useful information, but it lacks clarity and detail. It might contain minor inaccuracies. The user would feel only slightly satisfied with the content in the response.",
    "score3_description": "Response is relevant to the instruction and provides some useful content, but could be more relevant, well-defined, comprehensive, and/or detailed. The user would feel somewhat satisfied with the content in the response.",
    "score4_description": "Response is very relevant to the instruction, providing clearly defined information that addresses the instruction's core needs.  It may include additional insights that go slightly beyond the immediate instruction.  The user would feel quite satisfied with the content in the response.",
    "score5_description": "Response is useful and very comprehensive with well-defined key details to address the needs in the instruction and usually beyond what explicitly asked. The user would feel very satisfied with the content in the response.",
}

rubrics_score = RubricsScore(name="helpfulness", rubrics=helpfulness_rubrics)
context_precision = ContextPrecision(llm=evaluator_llm)
faithfulness = Faithfulness(llm=evaluator_llm)

准备您的数据集

要使用 Ragas 指标进行评估,您需要将数据转换为 EvaluationDataset,这是 Ragas 中的核心数据类型。有关其结构的更多详细信息,请参阅 Ragas 文档

# questions or query from user
user_inputs = [
    "Which part of the brain does short-term memory seem to rely on?",
    "What provided the Roman senate with exuberance?",
    "What area did the Hasan-jalalians command?",
]

# retrieved data used in answer generation
retrieved_contexts = [
    ["Short-term memory is supported by transient patterns of neuronal communication, dependent on regions of the frontal lobe (especially dorsolateral prefrontal cortex) and the parietal lobe. Long-term memory, on the other hand, is maintained by more stable and permanent changes in neural connections widely spread throughout the brain. The hippocampus is essential (for learning new information) to the consolidation of information from short-term to long-term memory, although it does not seem to store information itself. Without the hippocampus, new memories are unable to be stored into long-term memory, as learned from patient Henry Molaison after removal of both his hippocampi, and there will be a very short attention span. Furthermore, it may be involved in changing neural connections for a period of three months or more after the initial learning."],
    ["In 62 BC, Pompey returned victorious from Asia. The Senate, elated by its successes against Catiline, refused to ratify the arrangements that Pompey had made. Pompey, in effect, became powerless. Thus, when Julius Caesar returned from a governorship in Spain in 61 BC, he found it easy to make an arrangement with Pompey. Caesar and Pompey, along with Crassus, established a private agreement, now known as the First Triumvirate. Under the agreement, Pompey's arrangements would be ratified. Caesar would be elected consul in 59 BC, and would then serve as governor of Gaul for five years. Crassus was promised a future consulship."],
    ["The Seljuk Empire soon started to collapse. In the early 12th century, Armenian princes of the Zakarid noble family drove out the Seljuk Turks and established a semi-independent Armenian principality in Northern and Eastern Armenia, known as Zakarid Armenia, which lasted under the patronage of the Georgian Kingdom. The noble family of Orbelians shared control with the Zakarids in various parts of the country, especially in Syunik and Vayots Dzor, while the Armenian family of Hasan-Jalalians controlled provinces of Artsakh and Utik as the Kingdom of Artsakh."],
]

# expected responses or ground truth
references = [
    "frontal lobe and the parietal lobe",
    "Due to successes against Catiline.",
    "The Hasan-Jalalians commanded the area of Artsakh and Utik.",
]
from vertexai.generative_models import GenerativeModel

generation_config = {
    "max_output_tokens": 128,
    "temperature": 0.1,
}

model_a_name = "gemini-1.5-pro"
model_b_name = "gemini-1.0-pro"

gemini_model_15 = GenerativeModel(
    model_a_name,
    generation_config=generation_config,
)

gemini_model_1 = GenerativeModel(
    model_b_name,
    generation_config=generation_config,
)
responses_a = []
responses_b = []

# Template for creating the prompt
template = """Answer the question based only on the following context:
{context}

Question: {query}
"""

# Iterate through each user input and corresponding context
for i in range(len(user_inputs)):
    # Join the list of retrieved contexts into a single string
    context_str = "\n".join(retrieved_contexts[i])

    # Create prompt Generate response for Gemini 1.5 pro model
    gemini_15_prompt = template.format(context=context_str, query=user_inputs[i])

    gemini_15_response = gemini_model_15.generate_content(gemini_15_prompt)
    responses_a.append(gemini_15_response.text)

    # Create prompt Generate response for Gemini 1 pro model
    gemini_1_prompt = template.format(context=context_str, query=user_inputs[i])

    gemini_1_response = gemini_model_1.generate_content(gemini_1_prompt)
    responses_b.append(gemini_1_response.text)

将这些转换为 Ragas EvaluationDataset

from ragas.dataset_schema import SingleTurnSample, EvaluationDataset

n = len(user_inputs)

samples_a = []
samples_b = []

for i in range(n):
    sample_a = SingleTurnSample(
        user_input=user_inputs[i],
        retrieved_contexts=retrieved_contexts[i],
        response=responses_a[i],
        reference=references[i],
    )
    sample_b = SingleTurnSample(
        user_input=user_inputs[i],
        retrieved_contexts=retrieved_contexts[i],
        response=responses_b[i],
        reference=references[i],
    )

    samples_a.append(sample_a)
    samples_b.append(sample_b)

ragas_eval_dataset_a = EvaluationDataset(samples=samples_a)
ragas_eval_dataset_b = EvaluationDataset(samples=samples_b)

ragas_eval_dataset_a.to_pandas()
输出

user_input retrieved_contexts response reference
0 短期记忆依赖大脑的哪个部分…… [短期记忆由短暂的 p... 支持 短期记忆依赖于**额叶**和**顶叶**的区域…… 额叶和顶叶
1 是什么让罗马元老院欢欣鼓舞? [公元前62年,庞培从亚洲凯旋而归…… 罗马元老院因其在对抗喀提林的成功而欣喜若狂…… 由于在对抗喀提林方面取得了成功。
2 哈桑-贾拉勒王朝统治了哪个地区? [塞尔柱帝国很快开始崩溃。在…… 哈桑-贾拉勒王朝控制着阿尔察赫和乌提克的省份…… 哈桑-贾拉勒王朝统治着阿尔察赫地区……

ragas_eval_dataset_b.to_pandas()
输出

user_input retrieved_contexts response reference
0 短期记忆依赖大脑的哪个部分…… [短期记忆由短暂的 p... 支持 额叶,特别是背外侧前额叶皮层…… 额叶和顶叶
1 是什么让罗马元老院欢欣鼓舞? [公元前62年,庞培从亚洲凯旋而归…… 罗马元老院的欢欣鼓舞源于其…… 由于在对抗喀提林方面取得了成功。
2 哈桑-贾拉勒王朝统治了哪个地区? [塞尔柱帝国很快开始崩溃。在…… 哈桑-贾拉勒王朝控制着阿尔察赫和乌提克的省份…… 哈桑-贾拉勒王朝统治着阿尔察赫地区……

运行评估

通过将数据集和所需指标列表传递给 evaluate 函数,使用 Ragas 评估数据集。

from ragas import evaluate

ragas_metrics = [
    context_precision,
    faithfulness,
    rouge_score,
    rubrics_score,
]

ragas_result_rag_a = evaluate(
    dataset=ragas_eval_dataset_a, metrics=ragas_metrics, llm=evaluator_llm
)

ragas_result_rag_b = evaluate(
    dataset=ragas_eval_dataset_b, metrics=ragas_metrics, llm=evaluator_llm
)
Evaluating: 100%|██████████| 12/12 [00:00<?, ?it/s]

Evaluating: 100%|██████████| 12/12 [00:00<?, ?it/s]

将结果包装到 Google 的 EvalResult 结构中

from vertexai.evaluation import EvalResult

result_rag_a = EvalResult(
    summary_metrics=ragas_result_rag_a._repr_dict,
    metrics_table=ragas_result_rag_a.to_pandas(),
)

result_rag_b = EvalResult(
    summary_metrics=ragas_result_rag_b._repr_dict,
    metrics_table=ragas_result_rag_b.to_pandas(),
)

比较评估结果

查看摘要结果

如果您想在一个表格中查看所有评估指标的综合摘要,只需调用 display_eval_report() 辅助函数即可。

display_eval_report(
    eval_result=(
        f"{model_a_name} Eval Result",
        result_rag_a.summary_metrics,
        result_rag_a.metrics_table,
    ),
)
输出

gemini-1.5-pro 评估结果

摘要指标

context_precision faithfulness rouge_score(mode=fmeasure) 有用性
0 0.666667 1.0 0.56 4.333333

报告指标

user_input retrieved_contexts response reference context_precision faithfulness rouge_score(mode=fmeasure) 有用性
0 短期记忆依赖大脑的哪个部分…… [短期记忆由短暂的 p... 支持 短期记忆依赖于**额叶**和**顶叶**的区域…… 额叶和顶叶 1.0 1.0 0.48 5
1 是什么让罗马元老院欢欣鼓舞? [公元前62年,庞培从亚洲凯旋而归…… 罗马元老院因其在对抗喀提林的成功而欣喜若狂…… 由于在对抗喀提林方面取得了成功。 0.0 1.0 0.40 4
2 哈桑-贾拉勒王朝统治了哪个地区? [塞尔柱帝国很快开始崩溃。在…… 哈桑-贾拉勒王朝控制着阿尔察赫和乌提克的省份…… 哈桑-贾拉勒王朝统治着阿尔察赫地区…… 1.0 1.0 0.80 4

display_eval_report(
    (
        f"{model_b_name} Eval Result",
        result_rag_b.summary_metrics,
        result_rag_b.metrics_table,
    )
)
输出

gemini-1.0-pro 评估结果

摘要指标

context_precision faithfulness rouge_score(mode=fmeasure) 有用性
0 1.0 0.916667 0.479034 4.0

报告指标

user_input retrieved_contexts response reference context_precision faithfulness rouge_score(mode=fmeasure) 有用性
0 短期记忆依赖大脑的哪个部分…… [短期记忆由短暂的 p... 支持 额叶,特别是背外侧前额叶皮层…… 额叶和顶叶 1.0 1.00 0.666667 4
1 是什么让罗马元老院欢欣鼓舞? [公元前62年,庞培从亚洲凯旋而归…… 罗马元老院的欢欣鼓舞源于其…… 由于在对抗喀提林方面取得了成功。 1.0 0.75 0.130435 4
2 哈桑-贾拉勒王朝统治了哪个地区? [塞尔柱帝国很快开始崩溃。在…… 哈桑-贾拉勒王朝控制着阿尔察赫和乌提克的省份…… 哈桑-贾拉勒王朝统治着阿尔察赫地区…… 1.0 1.00 0.640000 4

可视化评估结果

eval_results = []

eval_results.append(
    (model_a_name, result_rag_a.summary_metrics, result_rag_a.metrics_table)
)
eval_results.append(
    (model_b_name, result_rag_b.summary_metrics, result_rag_b.metrics_table)
)

plot_radar_plot(eval_results, max_score=5)
Radar Plot

plot_bar_plot(eval_results)
Bar Plot

查看本系列的其他教程

  • Ragas 与 Vertex AI:学习如何将 Vertex AI 模型与 Ragas 结合使用,以评估您的 LLM 工作流。
  • 对齐 LLM 指标:训练和对齐您的 LLM 评估器,使其更好地匹配人类判断。