跳转到内容

Griptape 集成

如果您熟悉 Griptape 的 RAG 引擎并希望开始评估 RAG 系统的性能,那么您来对地方了。在本教程中,我们将探讨如何使用 Ragas 来评估您的 Griptape RAG 引擎生成的响应。

Griptape 设置

设置我们的环境

首先,让我们确保所有必需的软件包都已安装

%pip install "griptape[all]" ragas -q

创建我们的数据集

我们将使用一个关于主要 LLM 提供商的小型文本块数据集,并建立一个简单的 RAG 管道

chunks = [
    "OpenAI is one of the most recognized names in the large language model space, known for its GPT series of models. These models excel at generating human-like text and performing tasks like creative writing, answering questions, and summarizing content. GPT-4, their latest release, has set benchmarks in understanding context and delivering detailed responses.",
    "Anthropic is well-known for its Claude series of language models, designed with a strong focus on safety and ethical AI behavior. Claude is particularly praised for its ability to follow complex instructions and generate text that aligns closely with user intent.",
    "DeepMind, a division of Google, is recognized for its cutting-edge Gemini models, which are integrated into various Google products like Bard and Workspace tools. These models are renowned for their conversational abilities and their capacity to handle complex, multi-turn dialogues.",
    "Meta AI is best known for its LLaMA (Large Language Model Meta AI) series, which has been made open-source for researchers and developers. LLaMA models are praised for their ability to support innovation and experimentation due to their accessibility and strong performance.",
    "Meta AI with it's LLaMA models aims to democratize AI development by making high-quality models available for free, fostering collaboration across industries. Their open-source approach has been a game-changer for researchers without access to expensive resources.",
    "Microsoft’s Azure AI platform is famous for integrating OpenAI’s GPT models, enabling businesses to use these advanced models in a scalable and secure cloud environment. Azure AI powers applications like Copilot in Office 365, helping users draft emails, generate summaries, and more.",
    "Amazon’s Bedrock platform is recognized for providing access to various language models, including its own models and third-party ones like Anthropic’s Claude and AI21’s Jurassic. Bedrock is especially valued for its flexibility, allowing users to choose models based on their specific needs.",
    "Cohere is well-known for its language models tailored for business use, excelling in tasks like search, summarization, and customer support. Their models are recognized for being efficient, cost-effective, and easy to integrate into workflows.",
    "AI21 Labs is famous for its Jurassic series of language models, which are highly versatile and capable of handling tasks like content creation and code generation. The Jurassic models stand out for their natural language understanding and ability to generate detailed and coherent responses.",
    "In the rapidly advancing field of artificial intelligence, several companies have made significant contributions with their large language models. Notable players include OpenAI, known for its GPT Series (including GPT-4); Anthropic, which offers the Claude Series; Google DeepMind with its Gemini Models; Meta AI, recognized for its LLaMA Series; Microsoft Azure AI, which integrates OpenAI’s GPT Models; Amazon AWS (Bedrock), providing access to various models including Claude (Anthropic) and Jurassic (AI21 Labs); Cohere, which offers its own models tailored for business use; and AI21 Labs, known for its Jurassic Series. These companies are shaping the landscape of AI by providing powerful models with diverse capabilities.",
]

在向量存储中摄取数据

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
from griptape.drivers.embedding.openai import OpenAiEmbeddingDriver
from griptape.drivers.vector.local import LocalVectorStoreDriver

# Set up a simple vector store with our data
vector_store = LocalVectorStoreDriver(embedding_driver=OpenAiEmbeddingDriver())
vector_store.upsert_collection({"major_llm_providers": chunks})

设置 RAG 引擎

from griptape.engines.rag import RagContext, RagEngine
from griptape.engines.rag.modules import (
    PromptResponseRagModule,
    VectorStoreRetrievalRagModule,
)
from griptape.engines.rag.stages import (
    ResponseRagStage,
    RetrievalRagStage,
)

# Create a basic RAG pipeline
rag_engine = RagEngine(
    # Stage for retrieving relevant chunks
    retrieval_stage=RetrievalRagStage(
        retrieval_modules=[
            VectorStoreRetrievalRagModule(
                name="VectorStore_Retriever",
                vector_store_driver=vector_store,
                query_params={"namespace": "major_llm_providers"},
            ),
        ],
    ),
    # Stage for generating a response
    response_stage=ResponseRagStage(
        response_modules=[
            PromptResponseRagModule(),
        ]
    ),
)

测试我们的 RAG 管道

让我们通过一个示例查询来测试我们的 RAG 管道,以确保它能正常工作

rag_context = RagContext(query="What makes Meta AI’s LLaMA models stand out?")
rag_context = rag_engine.process(rag_context)
rag_context.outputs[0].to_text()
输出
"Meta AI's LLaMA models stand out for their open-source nature, which makes them accessible to researchers and developers. This accessibility supports innovation and experimentation, allowing for collaboration across industries. By making high-quality models available for free, Meta AI aims to democratize AI development, which has been a game-changer for researchers without access to expensive resources."

Ragas 评估

创建 Ragas 评估数据集

questions = [
    "Who are the major players in the large language model space?",
    "What is Microsoft’s Azure AI platform known for?",
    "What kind of models does Cohere provide?",
]

references = [
    "The major players include OpenAI (GPT Series), Anthropic (Claude Series), Google DeepMind (Gemini Models), Meta AI (LLaMA Series), Microsoft Azure AI (integrating GPT Models), Amazon AWS (Bedrock with Claude and Jurassic), Cohere (business-focused models), and AI21 Labs (Jurassic Series).",
    "Microsoft’s Azure AI platform is known for integrating OpenAI’s GPT models, enabling businesses to use these models in a scalable and secure cloud environment.",
    "Cohere provides language models tailored for business use, excelling in tasks like search, summarization, and customer support.",
]

griptape_rag_contexts = []

for que in questions:
    rag_context = RagContext(query=que)
    griptape_rag_contexts.append(rag_engine.process(rag_context))
from ragas.integrations.griptape import transform_to_ragas_dataset

ragas_eval_dataset = transform_to_ragas_dataset(
    grip_tape_rag_contexts=griptape_rag_contexts, references=references
)
ragas_eval_dataset.to_pandas()
user_input retrieved_contexts response reference
0 谁是大型语言领域的主要参与者... [在快速发展的人工智能领域... 大型语言模型领域的主要参与者... 主要参与者包括 OpenAI(GPT 系列)、...
1 微软的 Azure AI 平台以什么闻名? [微软的 Azure AI 平台因其... 微软的 Azure AI 平台以集成... 微软的 Azure AI 平台以集成...
2 Cohere 提供什么样的模型? [Cohere 以其语言模型而闻名... Cohere 提供专为... Cohere 提供专为...

运行 Ragas 评估

现在,让我们使用 Ragas 指标来评估我们的 RAG 系统

评估检索

为了评估我们的检索性能,我们可以利用 Ragas 的内置指标或创建适合我们特定需求的自定义指标。有关所有可用指标和自定义选项的完整列表,请访问文档

我们将使用 ContextPrecisionContextRecallContextRelevance 来衡量检索性能

from ragas.metrics import ContextPrecision, ContextRecall, ContextRelevance
from ragas import evaluate
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper

llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)

ragas_metrics = [
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm),
    ContextRelevance(llm=evaluator_llm),
]

retrieval_results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
retrieval_results.to_pandas()
Evaluating: 100%|██████████| 9/9 [00:15<00:00,  1.77s/it]

user_input retrieved_contexts response reference context_precision 上下文召回率 nv_context_relevance
0 谁是大型语言领域的主要参与者... [在快速发展的人工智能领域... 大型语言模型领域的主要参与者... 主要参与者包括 OpenAI(GPT 系列)、... 1.000000 1.0 1.0
1 微软的 Azure AI 平台以什么闻名? [微软的 Azure AI 平台因其... 微软的 Azure AI 平台以集成... 微软的 Azure AI 平台以集成... 1.000000 1.0 1.0
2 Cohere 提供什么样的模型? [Cohere 以其语言模型而闻名... Cohere 提供专为... Cohere 提供专为... 0.833333 1.0 1.0

评估生成

为了衡量生成性能,我们将使用 FactualCorrectnessFaithfulnessContextRelevance

from ragas.metrics import FactualCorrectness, Faithfulness, ResponseGroundedness

ragas_metrics = [
    FactualCorrectness(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
    ResponseGroundedness(llm=evaluator_llm),
]

genration_results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
genration_results.to_pandas()
Evaluating: 100%|██████████| 9/9 [00:17<00:00,  1.90s/it]

user_input retrieved_contexts response reference 事实正确性(模式=f1) faithfulness nv_response_groundedness
0 谁是大型语言领域的主要参与者... [在快速发展的人工智能领域... 大型语言模型领域的主要参与者... 主要参与者包括 OpenAI(GPT 系列)、... 1.00 1.000000 1.0
1 微软的 Azure AI 平台以什么闻名? [微软的 Azure AI 平台因其... 微软的 Azure AI 平台以集成... 微软的 Azure AI 平台以集成... 0.57 0.833333 1.0
2 Cohere 提供什么样的模型? [Cohere 以其语言模型而闻名... Cohere 提供专为... Cohere 提供专为... 0.57 1.000000 1.0

结论

恭喜!您已成功为您的 Griptape RAG 系统设置了 Ragas 评估管道。此评估为您系统的检索相关信息和生成准确响应的能力提供了宝贵的见解。

请记住,RAG 评估是一个迭代过程。使用这些指标来识别系统中的弱点,进行改进,并重新评估,直到达到您所需的性能水平。

RAG 之旅愉快!😄