评估一个简单的LLM应用

本指南旨在演示使用 ragas 测试和评估LLM应用的简单工作流程。它假设用户对AI应用构建和评估的知识最少。请参考我们的安装说明来安装 ragas。

评估

在本指南中，你将评估一个文本摘要流程。目标是确保输出摘要准确捕捉文本中指定的所有关键细节，例如增长数据、市场见解和其他重要信息。

ragas 提供了多种方法来分析LLM应用的性能，这些方法称为度量。每个度量都需要一组预定义的数据点，它使用这些数据点来计算表明性能的得分。

使用非LLM度量进行评估

这是一个使用 BleuScore 对摘要评分的简单示例

from ragas import SingleTurnSample
from ragas.metrics import BleuScore

test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
    "reference": "The company reported an 8% growth in Q3 2024, primarily driven by strong sales in the Asian market, attributed to strategic marketing and localized products, with continued growth anticipated in the next quarter."
}
metric = BleuScore()
test_data = SingleTurnSample(**test_data)
metric.single_turn_score(test_data)

输出

0.137

在这里我们使用了

一个测试样本，包含 user_input（用户输入）、response（来自LLM的输出）和 reference（来自LLM的预期输出）作为数据点来评估摘要。
一个名为 BleuScore 的非LLM度量。

你可能已经注意到，这种方法有两个主要的局限性

耗时的准备： 评估应用需要为每个输入准备预期的输出（reference），这既耗时又具有挑战性。
评分不准确： 即使 response 和 reference 相似，输出得分也很低。这是 BleuScore 等非LLM度量的已知局限性。

信息

非LLM度量是指不依赖于LLM进行评估的度量。

为了解决这些问题，让我们尝试一个基于LLM的度量。

使用基于LLM的度量进行评估

选择你的LLM

OpenAIAWSGoogle CloudAzure其他

安装 langchain-openai 包

pip install langchain-openai

确保你的OpenAI密钥已准备好并在你的环境中可用。

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"

将LLM包装在 LangchainLLMWrapper 中，以便可以与 ragas 一起使用。

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

安装 langchain-aws 包

pip install langchain-aws

然后你需要设置你的AWS凭据和配置

config = {
    "credentials_profile_name": "your-profile-name",  # E.g "default"
    "region_name": "your-region-name",  # E.g. "us-east-1"
    "llm": "your-llm-model-id",  # E.g "anthropic.claude-3-5-sonnet-20241022-v2:0"
    "embeddings": "your-embedding-model-id",  # E.g "amazon.titan-embed-text-v2:0"
    "temperature": 0.4,
}

定义你的LLM并将它们包装在 LangchainLLMWrapper 中，以便可以与 ragas 一起使用。

from langchain_aws import ChatBedrockConverse
from langchain_aws import BedrockEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

evaluator_llm = LangchainLLMWrapper(ChatBedrockConverse(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    base_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
    model=config["llm"],
    temperature=config["temperature"],
))
evaluator_embeddings = LangchainEmbeddingsWrapper(BedrockEmbeddings(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    model_id=config["embeddings"],
))

如果你想了解如何使用其他AWS服务的更多信息，请参考langchain-aws文档。

Google提供了两种访问其模型的方式：Google AI Studio和Google Cloud Vertex AI。Google AI Studio只需要一个Google账户和API密钥，而Vertex AI需要一个Google Cloud账户。如果你刚开始使用，请使用Google AI Studio。

首先，安装所需的包（只安装根据你选择的API所需的包）

# for Google AI Studio
pip install langchain-google-genai
# for Google Cloud Vertex AI
pip install langchain-google-vertexai

然后根据你选择的API设置你的凭据

对于 Google AI Studio

import os
os.environ["GOOGLE_API_KEY"] = "your-google-ai-key"  # From https://ai.google.dev/

对于 Google Cloud Vertex AI

# Ensure you have credentials configured (gcloud, workload identity, etc.)
# Or set service account JSON path:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/service-account.json"

定义你的配置

config = {
    "model": "gemini-1.5-pro",  # or other model IDs
    "temperature": 0.4,
    "max_tokens": None,
    "top_p": 0.8,
    # For Vertex AI only:
    "project": "your-project-id",  # Required for Vertex AI
    "location": "us-central1",     # Required for Vertex AI
}

初始化LLM并将其包装以便与 ragas 一起使用

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Choose the appropriate import based on your API:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_vertexai import ChatVertexAI

# Initialize with Google AI Studio
evaluator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(
    model=config["model"],
    temperature=config["temperature"],
    max_tokens=config["max_tokens"],
    top_p=config["top_p"],
))

# Or initialize with Vertex AI
evaluator_llm = LangchainLLMWrapper(ChatVertexAI(
    model=config["model"],
    temperature=config["temperature"],
    max_tokens=config["max_tokens"],
    top_p=config["top_p"],
    project=config["project"],
    location=config["location"],
))

你可以选择性地配置安全设置

from langchain_google_genai import HarmCategory, HarmBlockThreshold

safety_settings = {
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
    # Add other safety settings as needed
}

# Apply to your LLM initialization
evaluator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(
    model=config["model"],
    temperature=config["temperature"],
    safety_settings=safety_settings,
))

初始化嵌入并将其包装以便与 ragas 一起使用（选择以下之一）

# Google AI Studio Embeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings

evaluator_embeddings = LangchainEmbeddingsWrapper(GoogleGenerativeAIEmbeddings(
    model="models/embedding-001",  # Google's text embedding model
    task_type="retrieval_document"  # Optional: specify the task type
))

# Vertex AI Embeddings
from langchain_google_vertexai import VertexAIEmbeddings

evaluator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(
    model_name="textembedding-gecko@001",  # or other available model
    project=config["project"],  # Your GCP project ID
    location=config["location"]  # Your GCP location
))

有关可用模型、功能和配置的更多信息，请参考：Google AI Studio文档、Google Cloud Vertex AI文档、LangChain Google AI集成、LangChain Vertex AI集成

安装 langchain-openai 包

pip install langchain-openai

确保你的Azure OpenAI密钥已准备好并在你的环境中可用。

import os
os.environ["AZURE_OPENAI_API_KEY"] = "your-azure-openai-key"

# other configuration
azure_config = {
    "base_url": "",  # your endpoint
    "model_deployment": "",  # your model deployment name
    "model_name": "",  # your model name
    "embedding_deployment": "",  # your embedding deployment name
    "embedding_name": "",  # your embedding name
}

定义你的LLM并将它们包装在 LangchainLLMWrapper 中，以便可以与 ragas 一起使用。

from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
evaluator_llm = LangchainLLMWrapper(AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_endpoint=azure_config["base_url"],
    azure_deployment=azure_config["model_deployment"],
    model=azure_config["model_name"],
    validate_base_url=False,
))

# init the embeddings for answer_relevancy, answer_correctness and answer_similarity
evaluator_embeddings = LangchainEmbeddingsWrapper(AzureOpenAIEmbeddings(
    openai_api_version="2023-05-15",
    azure_endpoint=azure_config["base_url"],
    azure_deployment=azure_config["embedding_deployment"],
    model=azure_config["embedding_name"],
))

如果你想了解如何使用其他Azure服务的更多信息，请参考langchain-azure文档。

如果你使用不同的LLM提供商并使用Langchain与其交互，你可以将你的LLM包装在 LangchainLLMWrapper 中，以便可以与 ragas 一起使用。

from ragas.llms import LangchainLLMWrapper
evaluator_llm = LangchainLLMWrapper(your_llm_instance)

有关更详细的指南，请参阅自定义模型指南。

如果你使用LlamaIndex，你可以使用 LlamaIndexLLMWrapper 来包装你的LLM，以便可以与 ragas 一起使用。

from ragas.llms import LlamaIndexLLMWrapper
evaluator_llm = LlamaIndexLLMWrapper(your_llm_instance)

有关如何使用LlamaIndex的更多信息，请参考LlamaIndex集成指南。

如果你仍然无法使用你喜欢的LLM提供商与Ragas一起工作，请通过在此问题下评论告诉我们，我们会添加对它的支持🙂。

评估

在这里，我们将使用 AspectCritic，这是一个基于LLM的度量，根据评估标准输出通过/失败。

from ragas import SingleTurnSample
from ragas.metrics import AspectCritic

test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
}

metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.")
test_data = SingleTurnSample(**test_data)
await metric.single_turn_ascore(test_data)

输出

成功！这里 1 表示通过，0 表示失败。

信息

ragas 中还有许多其他类型的度量（带或不带 reference），如果没有适合你情况的度量，你也可以创建自己的度量。要了解更多信息，请参阅更多关于度量的信息。

在数据集上进行评估

在上面的示例中，我们仅使用单个样本来评估我们的应用。然而，仅对一个样本进行评估不足以保证结果的可信度。为了确保评估可靠，你应该向你的测试数据中添加更多测试样本。

在这里，我们将从Hugging Face Hub加载一个数据集，但你可以从任何来源加载数据，例如生产日志或其他数据集。只需确保每个样本包含所选度量所需的所有属性。

在我们的案例中，所需的属性是
- user_input: 提供给应用的输入（此处为输入文本报告）。
- response: 应用生成的输出（此处为生成的摘要）。

例如

[
    # Sample 1
    {
        "user_input": "summarise given text\nThe Q2 earnings report revealed a significant 15% increase in revenue, ...",
        "response": "The Q2 earnings report showed a 15% revenue increase, ...",
    },
    # Additional samples in the dataset
    ....,
    # Sample N
    {
        "user_input": "summarise given text\nIn 2023, North American sales experienced a 5% decline, ...",
        "response": "Companies are strategizing to adapt to market challenges and ...",
    }
]

from datasets import load_dataset
from ragas import EvaluationDataset
eval_dataset = load_dataset("explodinggradients/earning_report_summary",split="train")
eval_dataset = EvaluationDataset.from_hf_dataset(eval_dataset)
print("Features in dataset:", eval_dataset.features())
print("Total samples in dataset:", len(eval_dataset))

输出

Features in dataset: ['user_input', 'response']
Total samples in dataset: 50

使用数据集进行评估

from ragas import evaluate

results = evaluate(eval_dataset, metrics=[metric])
results

输出

{'summary_accuracy': 0.84}

该分数表明，在我们测试数据中的所有样本中，只有84%的摘要通过了给定的评估标准。现在，重要的是要了解为什么会这样。

将样本级别的得分导出到pandas dataframe

results.to_pandas()

输出

    user_input                                          response                                            summary_accuracy
0   summarise given text\nThe Q2 earnings report r...   The Q2 earnings report showed a 15% revenue in...   1
1   summarise given text\nIn 2023, North American ...   Companies are strategizing to adapt to market ...   1
2   summarise given text\nIn 2022, European expans...   Many companies experienced a notable 15% growt...   1
3   summarise given text\nSupply chain challenges ...   Supply chain challenges in North America, caus...   1

如上所示，在CSV文件中查看样本级别的结果对于快速检查来说可以，但对于详细分析或比较不同评估运行的结果并不理想。

想在使用评估来改进你的AI应用方面获得帮助吗？

在过去两年里，我们已经看到并帮助许多AI应用使用评估得到改进。

我们正在将这些知识压缩成一个产品，用评估循环取代“感觉检测”，以便你能专注于构建出色的AI应用。

如果你想在利用评估改进和扩展你的AI应用方面获得帮助。

🔗 预约时段或给我们发送邮件：founders@explodinggradients.com。

下一步

评估一个简单的RAG应用