入门指南：在 Vertex AI 中使用 Ragas

本教程是关于如何在 Ragas 中使用 Vertex AI 模型的三部分系列之一。本第一部分教程旨在奠定基础；其余两部分可按任意顺序进行。您可以使用下面的链接导航到其他教程

对齐大语言模型度量指标：训练并对齐您的大语言模型评估器，使其更好地匹配人工判断。
模型比较：使用 Ragas 度量指标比较 VertexAI 提供的基于 RAG 的问答任务的模型。

让我们开始吧！

概述

本 Notebook 演示了如何使用 Vertex AI Studio 中的生成模型，通过 Ragas 进行生成式 AI 评估的入门。

Ragas 是一个全面的评估库，旨在增强您的大语言模型应用的评估。它提供了一套工具和度量指标，使开发人员能够系统地评估和优化 AI 应用。

在本教程中，我们将探讨

准备用于 Ragas 评估的数据
Ragas 提供的各种类型度量指标概述

有关更多用例和高级功能，请参阅文档和操作指南部分的评估用例

入门

安装依赖项

!pip install --upgrade --user --quiet langchain-core langchain-google-vertexai langchain ragas rouge_score

重启运行时

要在本 Jupyter 运行时中使用新安装的软件包，必须重启运行时。您可以通过运行下面的单元格来实现，这将重启当前内核。

重启可能需要一分钟或更长时间。重启后，继续下一步。

import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

验证您的 Notebook 环境 (仅限 Colab)

如果您在 Google Colab 上运行此 Notebook，请运行以下单元格以验证您的环境。

import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

设置 Google Cloud 项目信息并初始化 Vertex AI SDK

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    raise ValueError("Please set your PROJECT_ID")


import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

使用 Ragas 度量指标设置评估

Ragas 是一个开源库，为您提供了一套全面的工具来评估 AI 应用。无论您需要基于计算的度量指标、基于模型的度量指标，还是甚至可以灵活地定义您自己的自定义度量指标，Ragas 都能满足您的需求。有关更多详细信息并探索所有支持的度量指标，请访问 Ragas 文档。

在下面的部分中，您将学习如何利用 Ragas 中提供的各种类型度量指标

自定义度量指标： 定义并集成最适合您的应用评估的度量指标。
基于模型的度量指标： 使用大语言模型调用，根据特定标准分析模型输出的评估，可以有或没有参考。
基于计算的度量指标： 基于数学公式的定量度量，不需要大语言模型调用。

定义评估器大语言模型

要使用和定义基于模型的度量指标，您需要配置评估器大语言模型和评估器嵌入。

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_google_vertexai import VertexAI, VertexAIEmbeddings


evaluator_llm = LangchainLLMWrapper(VertexAI(model_name="gemini-pro"))
evaluator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(model_name="text-embedding-004"))

使用 Ragas 构建您自己的基于模型的逐点度量指标

自定义或用户定义度量指标使您能够使用评分规则、评分标准或通过大语言模型调用执行的任何自定义逻辑来定制评估标准，以根据您独特的任务要求评估响应。有关更多详细信息，请参阅通用度量指标文档。

from ragas import evaluate
from ragas.metrics import AspectCritic, RubricsScore

maliciousness_definition = (
    "Is the response intended to harm, deceive, or exploit users?"
)

aspect_critic = AspectCritic(
    name="maliciousness",
    definition=maliciousness_definition,
    llm=evaluator_llm,
)

# adapeted google's helpfulness_prompt_template
helpfulness_rubrics = {
    "score1_description": "Response is useless/irrelevant, contains inaccurate/deceptive/misleading information, and/or contains harmful/offensive content. The user would feel not at all satisfied with the content in the response.",
    "score2_description": "Response is minimally relevant to the instruction and may provide some vaguely useful information, but it lacks clarity and detail. It might contain minor inaccuracies. The user would feel only slightly satisfied with the content in the response.",
    "score3_description": "Response is relevant to the instruction and provides some useful content, but could be more relevant, well-defined, comprehensive, and/or detailed. The user would feel somewhat satisfied with the content in the response.",
    "score4_description": "Response is very relevant to the instruction, providing clearly defined information that addresses the instruction's core needs.  It may include additional insights that go slightly beyond the immediate instruction.  The user would feel quite satisfied with the content in the response.",
    "score5_description": "Response is useful and very comprehensive with well-defined key details to address the needs in the instruction and usually beyond what explicitly asked. The user would feel very satisfied with the content in the response.",
}

rubrics_score = RubricsScore(name="helpfulness", rubrics=helpfulness_rubrics, llm=evaluator_llm)

Ragas 基于模型的度量指标

基于模型的度量指标利用预训练语言模型，通过将响应与特定标准进行比较来评估生成的文本，提供模仿人类判断的细致入微、上下文感知的评估。这些度量指标通过大语言模型调用计算。有关更多详细信息，请参见基于模型的度量指标文档。

from ragas import evaluate
from ragas.metrics import ContextPrecision, Faithfulness

context_precision = ContextPrecision(llm=evaluator_llm)
faithfulness = Faithfulness(llm=evaluator_llm)

Ragas 基于计算的度量指标

这些度量指标采用既定的字符串匹配、n-gram 和统计方法来量化文本相似度和质量，这些计算完全通过数学方式进行，无需大语言模型调用。有关更多详细信息，请访问基于计算的度量指标文档。

from ragas.metrics import RougeScore

rouge_score = RougeScore()

准备您的数据集

要使用 Ragas 度量指标执行评估，您需要将数据转换为 EvaluationDataset，这是 Ragas 中的一种数据类型。您可以在此处阅读更多信息。

例如，考虑以下样本数据

# questions or query from user
user_inputs = [
    "Which part of the brain does short-term memory seem to rely on?",
    "What provided the Roman senate with exuberance?",
    "What area did the Hasan-jalalians command?",
]

# retrieved data used in answer generation
retrieved_contexts = [
    ["Short-term memory is supported by transient patterns of neuronal communication, dependent on regions of the frontal lobe (especially dorsolateral prefrontal cortex) and the parietal lobe. Long-term memory, on the other hand, is maintained by more stable and permanent changes in neural connections widely spread throughout the brain. The hippocampus is essential (for learning new information) to the consolidation of information from short-term to long-term memory, although it does not seem to store information itself. Without the hippocampus, new memories are unable to be stored into long-term memory, as learned from patient Henry Molaison after removal of both his hippocampi, and there will be a very short attention span. Furthermore, it may be involved in changing neural connections for a period of three months or more after the initial learning."],
    ["In 62 BC, Pompey returned victorious from Asia. The Senate, elated by its successes against Catiline, refused to ratify the arrangements that Pompey had made. Pompey, in effect, became powerless. Thus, when Julius Caesar returned from a governorship in Spain in 61 BC, he found it easy to make an arrangement with Pompey. Caesar and Pompey, along with Crassus, established a private agreement, now known as the First Triumvirate. Under the agreement, Pompey's arrangements would be ratified. Caesar would be elected consul in 59 BC, and would then serve as governor of Gaul for five years. Crassus was promised a future consulship."],
    ["The Seljuk Empire soon started to collapse. In the early 12th century, Armenian princes of the Zakarid noble family drove out the Seljuk Turks and established a semi-independent Armenian principality in Northern and Eastern Armenia, known as Zakarid Armenia, which lasted under the patronage of the Georgian Kingdom. The noble family of Orbelians shared control with the Zakarids in various parts of the country, especially in Syunik and Vayots Dzor, while the Armenian family of Hasan-Jalalians controlled provinces of Artsakh and Utik as the Kingdom of Artsakh."],
]

# answers generated by the rag
responses = [
    "frontal lobe and the parietal lobe",
    "The Roman Senate was filled with exuberance due to successes against Catiline.",
    "The Hasan-Jalalians commanded the area of Syunik and Vayots Dzor.",
]

# expected responses or ground truth
references = [
    "frontal lobe and the parietal lobe",
    "Due to successes against Catiline.",
    "The Hasan-Jalalians commanded the area of Artsakh and Utik.",
]

将这些数据转换为 Ragas 的 EvaluationDataset

from ragas.dataset_schema import SingleTurnSample, EvaluationDataset

n = len(user_inputs)
samples = []


for i in range(n):

    sample = SingleTurnSample(
        user_input=user_inputs[i],
        retrieved_contexts=retrieved_contexts[i],
        response=responses[i],
        reference=references[i],
    )
    samples.append(sample)


ragas_eval_dataset = EvaluationDataset(samples=samples)
ragas_eval_dataset.to_pandas()

输出

	user_input	retrieved_contexts	response	reference
0	大脑的哪个部分负责短期记忆...	[短期记忆由瞬时 p...]	额叶和顶叶	额叶和顶叶
1	什么让罗马元老院感到兴奋？	[公元前 62 年，庞培从亚洲胜利归来...]	罗马元老院充满了喜悦，因为...	由于击败喀提林获得的成功。
2	哈桑-贾拉里安家族统治了哪个区域？	[塞尔柱帝国很快开始瓦解。在...]	哈桑-贾拉里安家族统治了 Syun... 区域	哈桑-贾拉里安家族统治了 Arts... 区域

运行评估

定义好评估数据集和所需度量指标后，您可以通过将它们传入 Ragas 的 evaluate 函数来运行评估

from ragas import evaluate

ragas_metrics = [aspect_critic, context_precision, faithfulness, rouge_score, rubrics_score]

result = evaluate(
    metrics=ragas_metrics,
    dataset=ragas_eval_dataset
)
result

Evaluating: 100%|██████████| 15/15 [00:00<?, ?it/s]

查看数据集中每一行的详细分数

result.to_pandas()

输出

	user_input	retrieved_contexts	response	reference	上下文精度	忠实度	rouge_score(mode=fmeasure)	有用性
0	大脑的哪个部分负责短期记忆...	[短期记忆由瞬时 p...]	额叶和顶叶	额叶和顶叶	1.0	1.0	1.000000	4
1	什么让罗马元老院感到兴奋？	[公元前 62 年，庞培从亚洲胜利归来...]	罗马元老院充满了喜悦，因为...	由于击败喀提林获得的成功。	0.0	1.0	0.588235	5
2	哈桑-贾拉里安家族统治了哪个区域？	[塞尔柱帝国很快开始瓦解。在...]	哈桑-贾拉里安家族统治了 Syun... 区域	哈桑-贾拉里安家族统治了 Arts... 区域	1.0	0.0	0.761905	4

查看本系列的其他教程

对齐大语言模型度量指标：训练并对齐您的大语言模型评估器，使其更好地匹配人工判断。
模型比较：使用 Ragas 度量指标比较 VertexAI 提供的基于 RAG 的问答任务的模型。