跳转到内容

入门:将 Ragas 与 Vertex AI 结合使用

本教程是关于如何将 Vertex AI 模型与 Ragas 结合使用的三部分系列教程之一。第一个教程旨在为后续内容奠定基础;其余两个教程可以按任意顺序学习。您可以使用以下链接导航到其他教程

  • 对齐 LLM 指标:训练和对齐您的 LLM 评估器,以更好地匹配人类判断。
  • 模型比较:使用 Ragas 指标,比较 VertexAI 提供的模型在基于 RAG 的问答任务上的表现。

让我们开始吧!

概述

本笔记本演示了如何开始使用 Ragas 和 Vertex AI Studio 中的生成模型进行生成式 AI 评估。

Ragas 是一个全面的评估库,旨在增强对 LLM 应用程序的评估。它提供了一套工具和指标,使开发人员能够系统地评估和优化 AI 应用程序。

在本教程中,我们将探讨

  1. 为 Ragas 评估准备数据
  2. Ragas 提供的各类指标概述

有关其他用例和高级功能,请参阅文档和“操作方法”部分中的评估用例

入门指南

安装依赖项

!pip install --upgrade --user --quiet langchain-core langchain-google-vertexai langchain ragas rouge_score

重启运行时

要在此 Jupyter 运行时中使用新安装的包,您必须重新启动运行时。您可以通过运行下面的单元格来完成此操作,它会重新启动当前内核。

重启可能需要一分钟或更长时间。重启后,继续下一步。

import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

验证您的笔记本环境(仅限 Colab)

如果您在 Google Colab 上运行此笔记本,请运行下面的单元格来验证您的环境。

import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

设置 Google Cloud 项目信息并初始化 Vertex AI SDK

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    raise ValueError("Please set your PROJECT_ID")


import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

使用 Ragas 指标设置评估

Ragas 是一个开源库,为您提供了一套全面的工具来评估 AI 应用程序。无论您需要基于计算的指标、基于模型的指标,还是需要定义自己的自定义指标的灵活性,Ragas 都能满足您的需求。有关更多详细信息和探索所有支持的指标,请访问 Ragas 文档

在以下部分中,您将学习如何利用 Ragas 中可用的各种类型的指标

  • 自定义指标: 定义并集成最适合您的应用程序评估的自定义指标。
  • 基于模型的指标: 使用 LLM 调用,根据特定标准(无论有无参考)分析模型输出的评估。
  • 基于计算的指标: 基于数学公式的量化度量,不需要 LLM 调用。

定义 evaluator_llm

要使用和定义基于模型的指标,您需要配置评估器 LLM 和评估器嵌入。

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_google_vertexai import VertexAI, VertexAIEmbeddings


evaluator_llm = LangchainLLMWrapper(VertexAI(model_name="gemini-pro"))
evaluator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(model_name="text-embedding-004"))

使用 Ragas 构建您自己的基于模型的逐点指标

自定义或用户定义的指标使您能够使用评分规则、评分标准或通过 LLM 调用执行的任何自定义逻辑来定制评估标准,从而根据您独特的任务要求评估响应。有关更多详细信息,请参阅通用指标文档

from ragas import evaluate
from ragas.metrics import AspectCritic, RubricsScore

maliciousness_definition = (
    "Is the response intended to harm, deceive, or exploit users?"
)

aspect_critic = AspectCritic(
    name="maliciousness",
    definition=maliciousness_definition,
    llm=evaluator_llm,
)

# adapeted google's helpfulness_prompt_template
helpfulness_rubrics = {
    "score1_description": "Response is useless/irrelevant, contains inaccurate/deceptive/misleading information, and/or contains harmful/offensive content. The user would feel not at all satisfied with the content in the response.",
    "score2_description": "Response is minimally relevant to the instruction and may provide some vaguely useful information, but it lacks clarity and detail. It might contain minor inaccuracies. The user would feel only slightly satisfied with the content in the response.",
    "score3_description": "Response is relevant to the instruction and provides some useful content, but could be more relevant, well-defined, comprehensive, and/or detailed. The user would feel somewhat satisfied with the content in the response.",
    "score4_description": "Response is very relevant to the instruction, providing clearly defined information that addresses the instruction's core needs.  It may include additional insights that go slightly beyond the immediate instruction.  The user would feel quite satisfied with the content in the response.",
    "score5_description": "Response is useful and very comprehensive with well-defined key details to address the needs in the instruction and usually beyond what explicitly asked. The user would feel very satisfied with the content in the response.",
}

rubrics_score = RubricsScore(name="helpfulness", rubrics=helpfulness_rubrics, llm=evaluator_llm)

Ragas 基于模型的指标

基于模型的指标利用预训练的语言模型,通过将响应与特定标准进行比较来评估生成的文本,提供模拟人类判断的、细致入微且具有上下文感知能力的评估。这些指标通过 LLM 调用来计算。有关更多详细信息,请参阅基于模型的指标文档

from ragas import evaluate
from ragas.metrics import ContextPrecision, Faithfulness

context_precision = ContextPrecision(llm=evaluator_llm)
faithfulness = Faithfulness(llm=evaluator_llm)

Ragas 基于计算的指标

这些指标采用既定的字符串匹配、n-gram 和统计方法,完全通过数学计算来量化文本相似度和质量,无需 LLM 调用。有关更多详细信息,请访问基于计算的指标文档

from ragas.metrics import RougeScore

rouge_score = RougeScore()

准备您的数据集

要使用 Ragas 指标执行评估,您需要将数据转换为 EvaluationDataset,这是 Ragas 中的一种数据类型。您可以在此处阅读有关它的更多信息。

例如,请考虑以下示例数据

# questions or query from user
user_inputs = [
    "Which part of the brain does short-term memory seem to rely on?",
    "What provided the Roman senate with exuberance?",
    "What area did the Hasan-jalalians command?",
]

# retrieved data used in answer generation
retrieved_contexts = [
    ["Short-term memory is supported by transient patterns of neuronal communication, dependent on regions of the frontal lobe (especially dorsolateral prefrontal cortex) and the parietal lobe. Long-term memory, on the other hand, is maintained by more stable and permanent changes in neural connections widely spread throughout the brain. The hippocampus is essential (for learning new information) to the consolidation of information from short-term to long-term memory, although it does not seem to store information itself. Without the hippocampus, new memories are unable to be stored into long-term memory, as learned from patient Henry Molaison after removal of both his hippocampi, and there will be a very short attention span. Furthermore, it may be involved in changing neural connections for a period of three months or more after the initial learning."],
    ["In 62 BC, Pompey returned victorious from Asia. The Senate, elated by its successes against Catiline, refused to ratify the arrangements that Pompey had made. Pompey, in effect, became powerless. Thus, when Julius Caesar returned from a governorship in Spain in 61 BC, he found it easy to make an arrangement with Pompey. Caesar and Pompey, along with Crassus, established a private agreement, now known as the First Triumvirate. Under the agreement, Pompey's arrangements would be ratified. Caesar would be elected consul in 59 BC, and would then serve as governor of Gaul for five years. Crassus was promised a future consulship."],
    ["The Seljuk Empire soon started to collapse. In the early 12th century, Armenian princes of the Zakarid noble family drove out the Seljuk Turks and established a semi-independent Armenian principality in Northern and Eastern Armenia, known as Zakarid Armenia, which lasted under the patronage of the Georgian Kingdom. The noble family of Orbelians shared control with the Zakarids in various parts of the country, especially in Syunik and Vayots Dzor, while the Armenian family of Hasan-Jalalians controlled provinces of Artsakh and Utik as the Kingdom of Artsakh."],
]

# answers generated by the rag
responses = [
    "frontal lobe and the parietal lobe",
    "The Roman Senate was filled with exuberance due to successes against Catiline.",
    "The Hasan-Jalalians commanded the area of Syunik and Vayots Dzor.",
]

# expected responses or ground truth
references = [
    "frontal lobe and the parietal lobe",
    "Due to successes against Catiline.",
    "The Hasan-Jalalians commanded the area of Artsakh and Utik.",
]

将这些转换为 Ragas 的 EvaluationDataset

from ragas.dataset_schema import SingleTurnSample, EvaluationDataset

n = len(user_inputs)
samples = []


for i in range(n):

    sample = SingleTurnSample(
        user_input=user_inputs[i],
        retrieved_contexts=retrieved_contexts[i],
        response=responses[i],
        reference=references[i],
    )
    samples.append(sample)


ragas_eval_dataset = EvaluationDataset(samples=samples)
ragas_eval_dataset.to_pandas()
输出

user_input retrieved_contexts response reference
0 大脑的哪个部分负责短期记忆... [短期记忆由短暂的 p... 支持 额叶和顶叶 额叶和顶叶
1 是什么让罗马元老院欢欣鼓舞? [公元前62年,庞培从亚...胜利归来 罗马元老院因...而欢欣鼓舞 因成功对抗喀提林。
2 哈桑-贾拉勒王朝统治哪个地区? [塞尔柱帝国很快开始崩溃。在... 哈桑-贾拉勒王朝统治着休尼...地区 哈桑-贾拉勒王朝统治着阿尔察...地区

运行评估

在定义了评估数据集和所需指标后,您可以通过将它们传递给 Ragas 的 evaluate 函数来运行评估

from ragas import evaluate

ragas_metrics = [aspect_critic, context_precision, faithfulness, rouge_score, rubrics_score]

result = evaluate(
    metrics=ragas_metrics,
    dataset=ragas_eval_dataset
)
result
Evaluating: 100%|██████████| 15/15 [00:00<?, ?it/s]

查看数据集中每一行的详细分数

result.to_pandas()

输出

user_input retrieved_contexts response reference 恶意性 context_precision faithfulness rouge_score(mode=fmeasure) 帮助性
0 大脑的哪个部分负责短期记忆... [短期记忆由短暂的 p... 支持 额叶和顶叶 额叶和顶叶 0 1.0 1.0 1.000000 4
1 是什么让罗马元老院欢欣鼓舞? [公元前62年,庞培从亚...胜利归来 罗马元老院因...而欢欣鼓舞 因成功对抗喀提林。 0 0.0 1.0 0.588235 5
2 哈桑-贾拉勒王朝统治哪个地区? [塞尔柱帝国很快开始崩溃。在... 哈桑-贾拉勒王朝统治着休尼...地区 哈桑-贾拉勒王朝统治着阿尔察...地区 0 1.0 0.0 0.761905 4

查看本系列的其他教程

  • 对齐 LLM 指标:训练和对齐您的 LLM 评估器,以更好地匹配人类判断。
  • 模型比较:使用 Ragas 指标,比较 VertexAI 提供的模型在基于 RAG 的问答任务上的表现。