Griptape 集成

如果您熟悉 Griptape 的 RAG Engine 并想开始评估您的 RAG 系统的性能，那么您来对地方了。在本教程中，我们将探讨如何使用 Ragas 来评估您的 Griptape RAG Engine 生成的响应。

Griptape 设置

设置我们的环境

首先，我们确保已安装所有必需的软件包

%pip install "griptape[all]" ragas -q

创建我们的数据集

我们将使用一个包含主要 LLM 提供商相关文本片段的小数据集，并建立一个简单的 RAG Pipeline

chunks = [
    "OpenAI is one of the most recognized names in the large language model space, known for its GPT series of models. These models excel at generating human-like text and performing tasks like creative writing, answering questions, and summarizing content. GPT-4, their latest release, has set benchmarks in understanding context and delivering detailed responses.",
    "Anthropic is well-known for its Claude series of language models, designed with a strong focus on safety and ethical AI behavior. Claude is particularly praised for its ability to follow complex instructions and generate text that aligns closely with user intent.",
    "DeepMind, a division of Google, is recognized for its cutting-edge Gemini models, which are integrated into various Google products like Bard and Workspace tools. These models are renowned for their conversational abilities and their capacity to handle complex, multi-turn dialogues.",
    "Meta AI is best known for its LLaMA (Large Language Model Meta AI) series, which has been made open-source for researchers and developers. LLaMA models are praised for their ability to support innovation and experimentation due to their accessibility and strong performance.",
    "Meta AI with it's LLaMA models aims to democratize AI development by making high-quality models available for free, fostering collaboration across industries. Their open-source approach has been a game-changer for researchers without access to expensive resources.",
    "Microsoft’s Azure AI platform is famous for integrating OpenAI’s GPT models, enabling businesses to use these advanced models in a scalable and secure cloud environment. Azure AI powers applications like Copilot in Office 365, helping users draft emails, generate summaries, and more.",
    "Amazon’s Bedrock platform is recognized for providing access to various language models, including its own models and third-party ones like Anthropic’s Claude and AI21’s Jurassic. Bedrock is especially valued for its flexibility, allowing users to choose models based on their specific needs.",
    "Cohere is well-known for its language models tailored for business use, excelling in tasks like search, summarization, and customer support. Their models are recognized for being efficient, cost-effective, and easy to integrate into workflows.",
    "AI21 Labs is famous for its Jurassic series of language models, which are highly versatile and capable of handling tasks like content creation and code generation. The Jurassic models stand out for their natural language understanding and ability to generate detailed and coherent responses.",
    "In the rapidly advancing field of artificial intelligence, several companies have made significant contributions with their large language models. Notable players include OpenAI, known for its GPT Series (including GPT-4); Anthropic, which offers the Claude Series; Google DeepMind with its Gemini Models; Meta AI, recognized for its LLaMA Series; Microsoft Azure AI, which integrates OpenAI’s GPT Models; Amazon AWS (Bedrock), providing access to various models including Claude (Anthropic) and Jurassic (AI21 Labs); Cohere, which offers its own models tailored for business use; and AI21 Labs, known for its Jurassic Series. These companies are shaping the landscape of AI by providing powerful models with diverse capabilities.",
]

将数据导入向量存储

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

from griptape.drivers.embedding.openai import OpenAiEmbeddingDriver
from griptape.drivers.vector.local import LocalVectorStoreDriver

# Set up a simple vector store with our data
vector_store = LocalVectorStoreDriver(embedding_driver=OpenAiEmbeddingDriver())
vector_store.upsert_collection({"major_llm_providers": chunks})

设置 RAG 引擎

from griptape.engines.rag import RagContext, RagEngine
from griptape.engines.rag.modules import (
    PromptResponseRagModule,
    VectorStoreRetrievalRagModule,
)
from griptape.engines.rag.stages import (
    ResponseRagStage,
    RetrievalRagStage,
)

# Create a basic RAG pipeline
rag_engine = RagEngine(
    # Stage for retrieving relevant chunks
    retrieval_stage=RetrievalRagStage(
        retrieval_modules=[
            VectorStoreRetrievalRagModule(
                name="VectorStore_Retriever",
                vector_store_driver=vector_store,
                query_params={"namespace": "major_llm_providers"},
            ),
        ],
    ),
    # Stage for generating a response
    response_stage=ResponseRagStage(
        response_modules=[
            PromptResponseRagModule(),
        ]
    ),
)

测试我们的 RAG Pipeline

让我们用一个示例查询来测试我们的 RAG Pipeline，确保它能正常工作

rag_context = RagContext(query="What makes Meta AI’s LLaMA models stand out?")
rag_context = rag_engine.process(rag_context)
rag_context.outputs[0].to_text()

输出

"Meta AI's LLaMA models stand out for their open-source nature, which makes them accessible to researchers and developers. This accessibility supports innovation and experimentation, allowing for collaboration across industries. By making high-quality models available for free, Meta AI aims to democratize AI development, which has been a game-changer for researchers without access to expensive resources."

Ragas 评估

创建 Ragas 评估数据集

questions = [
    "Who are the major players in the large language model space?",
    "What is Microsoft’s Azure AI platform known for?",
    "What kind of models does Cohere provide?",
]

references = [
    "The major players include OpenAI (GPT Series), Anthropic (Claude Series), Google DeepMind (Gemini Models), Meta AI (LLaMA Series), Microsoft Azure AI (integrating GPT Models), Amazon AWS (Bedrock with Claude and Jurassic), Cohere (business-focused models), and AI21 Labs (Jurassic Series).",
    "Microsoft’s Azure AI platform is known for integrating OpenAI’s GPT models, enabling businesses to use these models in a scalable and secure cloud environment.",
    "Cohere provides language models tailored for business use, excelling in tasks like search, summarization, and customer support.",
]

griptape_rag_contexts = []

for que in questions:
    rag_context = RagContext(query=que)
    griptape_rag_contexts.append(rag_engine.process(rag_context))

from ragas.integrations.griptape import transform_to_ragas_dataset

ragas_eval_dataset = transform_to_ragas_dataset(
    grip_tape_rag_contexts=griptape_rag_contexts, references=references
)

ragas_eval_dataset.to_pandas()

	user_input	retrieved_contexts	response	reference
0	大型语言模型的主要参与者是谁？...	[在迅速发展的人工智能领域，...	大型语言模型的主要参与者包括...	主要参与者包括 OpenAI (GPT 系列)，...
1	微软的 Azure AI 平台以什么闻名？	[微软的 Azure AI 平台以其集成...	微软的 Azure AI 平台以其集成...	微软的 Azure AI 平台以其集成...
2	Cohere 提供哪种类型的模型？	[Cohere 以其语言模型闻名，...	Cohere 提供专为商业用途定制的语言模型，...	Cohere 提供专为商业用途定制的语言模型，...

运行 Ragas 评估

现在，让我们使用 Ragas 指标来评估我们的 RAG 系统

评估检索

为了评估我们的检索性能，我们可以利用 Ragas 内置的指标或创建针对我们特定需求的自定义指标。有关所有可用指标和自定义选项的完整列表，请访问文档。

我们将使用 ContextPrecision, ContextRecall 和 ContextRelevance 来衡量检索性能

ContextPrecision：衡量 RAG 系统的检索器在给定查询的检索上下文中，将相关片段排在前面的能力，计算方法是对所有片段的平均 precision@k。
ContextRecall：衡量从知识库中成功检索到的相关信息的比例。
ContextRelevance：通过双重 LLM 判断来评估检索到的上下文与用户查询的相关性，衡量其匹配程度。

from ragas.metrics import ContextPrecision, ContextRecall, ContextRelevance
from ragas import evaluate
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper

llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)

ragas_metrics = [
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm),
    ContextRelevance(llm=evaluator_llm),
]

retrieval_results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
retrieval_results.to_pandas()

Evaluating: 100%|██████████| 9/9 [00:15<00:00,  1.77s/it]

	user_input	retrieved_contexts	response	reference	context_precision	context_recall	nv_context_relevance
0	大型语言模型的主要参与者是谁？...	[在迅速发展的人工智能领域，...	大型语言模型的主要参与者包括...	主要参与者包括 OpenAI (GPT 系列)，...	1.000000	1.0	1.0
1	微软的 Azure AI 平台以什么闻名？	[微软的 Azure AI 平台以其集成...	微软的 Azure AI 平台以其集成...	微软的 Azure AI 平台以其集成...	1.000000	1.0	1.0
2	Cohere 提供哪种类型的模型？	[Cohere 以其语言模型闻名，...	Cohere 提供专为商业用途定制的语言模型，...	Cohere 提供专为商业用途定制的语言模型，...	0.833333	1.0	1.0

评估生成

为了衡量生成性能，我们将使用 FactualCorrectness, Faithfulness 和 ContextRelevance

FactualCorrectness：检查响应中的所有陈述是否得到参考答案的支持。
Faithfulness：衡量响应与检索到的上下文在事实上的连贯性。
ResponseGroundedness：衡量响应是否基于提供的上下文，有助于识别幻觉或编造的信息。

from ragas.metrics import FactualCorrectness, Faithfulness, ResponseGroundedness

ragas_metrics = [
    FactualCorrectness(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
    ResponseGroundedness(llm=evaluator_llm),
]

genration_results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
genration_results.to_pandas()

Evaluating: 100%|██████████| 9/9 [00:17<00:00,  1.90s/it]

	user_input	retrieved_contexts	response	reference	factual_correctness(mode=f1)	faithfulness	nv_response_groundedness
0	大型语言模型的主要参与者是谁？...	[在迅速发展的人工智能领域，...	大型语言模型的主要参与者包括...	主要参与者包括 OpenAI (GPT 系列)，...	1.00	1.000000	1.0
1	微软的 Azure AI 平台以什么闻名？	[微软的 Azure AI 平台以其集成...	微软的 Azure AI 平台以其集成...	微软的 Azure AI 平台以其集成...	0.57	0.833333	1.0
2	Cohere 提供哪种类型的模型？	[Cohere 以其语言模型闻名，...	Cohere 提供专为商业用途定制的语言模型，...	Cohere 提供专为商业用途定制的语言模型，...	0.57	1.000000	1.0

结论

恭喜！您已成功为您的 Griptape RAG 系统设置了 Ragas 评估流程。这项评估为您提供了关于您的系统检索相关信息和生成准确响应效果的宝贵见解。

请记住，RAG 评估是一个迭代过程。利用这些指标识别系统中的弱点，进行改进，然后重新评估，直到达到您所需的性能水平。

愉快地 RAGging！😄