LlamaIndex

LlamaIndex 是一个用于LLM应用程序的数据框架，可以摄取、结构化和访问私有或领域特定的数据。它使得将LLM与您自己的数据连接变得非常容易。但为了找出LlamaIndex和您的数据的最佳配置，您需要一个客观的性能衡量标准。这就是ragas发挥作用的地方。Ragas将帮助您评估您的 QueryEngine，并让您有信心调整配置以获得最高分。

本指南假设您熟悉LlamaIndex框架。

构建测试集

您将需要一个测试集来评估您的 QueryEngine。您可以自己构建一个，也可以使用Ragas中的测试集生成模块来开始创建一个小型的合成测试集。

让我们看看这在LlamaIndex中是如何工作的。

加载文档

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./nyc_wikipedia").load_data()

现在，让我们用相应的生成器和评判器LLM来初始化 TestsetGenerator 对象。

from ragas.testset import TestsetGenerator

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# generator with openai models
generator_llm = OpenAI(model="gpt-4o")
embeddings = OpenAIEmbedding(model="text-embedding-3-large")

generator = TestsetGenerator.from_llama_index(
    llm=generator_llm,
    embedding_model=embeddings,
)

现在您已经准备好生成数据集了

# generate testset
testset = generator.generate_with_llamaindex_docs(
    documents,
    testset_size=5,
)

df = testset.to_pandas()
df.head()

	user_input	reference_contexts	reference	synthesizer_name
0	你能解释一下纽约市的角色吗...	[纽约，通常被称为纽约市或NYC，...	纽约市在地理和文化上是...	single_hop_specifc_query_synthesizer
1	那么，在被称为纽约市之前，它叫什么名字...	[历史 == === 早期历史 === 在前哥伦布时期...	在它被称为纽约之前，该地区被称为...	single_hop_specifc_query_synthesizer
2	纽约的奴隶制发生了什么，以及它如何...	[并以威廉的名字将其重新命名为“新奥兰治”...	在18世纪初，纽约成为一个...	single_hop_specifc_query_synthesizer
3	长岛在...方面有何历史意义？	[<1-hop>\n\n历史 == === 早期历史 === 在...	长岛在...方面具有历史意义。	multi_hop_specific_query_synthesizer
4	斯塔滕岛渡轮在...方面扮演什么角色？	[<1-hop>\n\n于2017年开始服务；这将...	斯塔滕岛渡轮扮演着重要的角色...	multi_hop_specific_query_synthesizer

有了用于测试我们 QueryEngine 的测试数据集，现在让我们构建一个并对其进行评估。

构建 `QueryEngine`

首先，让我们以纽约市的维基百科页面为例，在其上构建一个 VectorStoreIndex，并使用ragas来评估它。

由于我们已经将数据集加载到 documents 中，让我们直接使用它。

# build query engine
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

query_engine = vector_index.as_query_engine()

让我们从生成的测试集中尝试一个示例问题，看看它是否正常工作。

# convert it to pandas dataset
df = testset.to_pandas()
df["user_input"][0]

'Cud yu pleese explane the role of New York City within the Northeast megalopolis, and how it contributes to the cultural and economic vibrancy of the region?'

response_vector = query_engine.query(df["user_input"][0])

print(response_vector)

New York City serves as a key hub within the Northeast megalopolis, playing a significant role in enhancing the cultural and economic vibrancy of the region. Its status as a global center of creativity, entrepreneurship, and cultural diversity contributes to the overall dynamism of the area. The city's renowned arts scene, including Broadway theatre and numerous cultural institutions, attracts artists and audiences from around the world, enriching the cultural landscape of the Northeast megalopolis. Economically, New York City's position as a leading financial and fintech center, home to major stock exchanges and a bustling real estate market, bolsters the region's economic strength and influence. Additionally, the city's diverse culinary scene, influenced by its immigrant history, adds to the cultural richness of the region, making New York City a vital component of the Northeast megalopolis's cultural and economic tapestry.

评估 `QueryEngine`

现在我们有了 VectorStoreIndex 的 QueryEngine，我们可以使用Ragas提供的llama_index集成来评估它。

为了使用Ragas和LlamaIndex进行评估，您需要3样东西：

LlamaIndex QueryEngine：我们将要评估的对象
指标（Metrics）：Ragas定义了一套可以衡量 QueryEngine 不同方面的指标。可用的指标及其含义可以在这里找到。
问题（Questions）：ragas将用来测试 QueryEngine 的问题列表。

首先，让我们生成问题。理想情况下，您应该使用在生产环境中看到的问题，这样我们评估时使用的问题分布就能与生产环境中看到的问题分布相匹配。这确保了分数能反映在生产环境中观察到的性能，但作为开始，我们将使用一些示例问题。

现在，让我们导入将要用于评估的指标。

# import metrics
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextPrecision,
    ContextRecall,
)

# init metrics with evaluator LLM
from ragas.llms import LlamaIndexLLMWrapper

evaluator_llm = LlamaIndexLLMWrapper(OpenAI(model="gpt-4o"))
metrics = [
    Faithfulness(llm=evaluator_llm),
    AnswerRelevancy(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm),
]

evaluate() 函数需要一个包含 "question" 和 "ground_truth" 的字典作为指标的输入。您可以轻松地将 testset 转换为该格式。

# convert to Ragas Evaluation Dataset
ragas_dataset = testset.to_evaluation_dataset()
ragas_dataset

EvaluationDataset(features=['user_input', 'reference_contexts', 'reference'], len=6)

最后，让我们运行评估。

from ragas.integrations.llama_index import evaluate

result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=ragas_dataset,
)

# final scores
print(result)

{'faithfulness': 0.7454, 'answer_relevancy': 0.9348, 'context_precision': 0.6667, 'context_recall': 0.4667}

您可以将其转换为pandas DataFrame以进行更多分析。

result.to_pandas()

	user_input	retrieved_contexts	reference_contexts	response	reference	faithfulness	answer_relevancy	context_precision	context_recall
0	你能解释一下纽约市的角色吗...	[及其自由与和平的理想。在2...	[纽约，通常被称为纽约市或NYC，...	纽约市在...中扮演着重要角色。	纽约市在地理和文化上是...	0.615385	0.918217	0.0	0.0
1	那么，在被称为纽约市之前，它叫什么名字...	[纽约市是全球...的总部。	[历史 == === 早期历史 === 在前哥伦布时期...	在被称为纽约市之前，它曾被命名为新阿姆斯特丹...	在它被称为纽约之前，该地区被称为...	1.000000	0.967821	1.0	1.0
2	纽约的奴隶制发生了什么，以及它如何...	[=== 纽约省与奴隶制 ===\n\n在...	[并以威廉的名字将其重新命名为“新奥兰治”...	奴隶制成为纽约...的重要组成部分。	在18世纪初，纽约成为一个...	1.000000	0.919264	1.0	1.0
3	长岛在...方面有何历史意义？	[==== 跨河交通 ====\n\n纽约市是...	[<1-hop>\n\n历史 == === 早期历史 === 在...	长岛在...的早期扮演了重要角色。	长岛在...方面具有历史意义。	0.500000	0.931895	0.0	0.0
4	斯塔滕岛渡轮在...方面扮演什么角色？	[==== 公共汽车 ====\n\n纽约市的公共汽车...	[<1-hop>\n\n于2017年开始服务；这将...	斯塔滕岛渡轮是...的重要交通方式。	斯塔滕岛渡轮扮演着重要的角色...	0.500000	0.936920	1.0	0.0
5	中央公园作为文化和...的角色如何？	[==== 州立公园 ====\n\n有七个州立...	[<1-hop>\n\n城市拥有超过28,000英亩（110平方公里）...	中央公园作为文化和历史...的角色...	中央公园，位于曼哈顿中上城...	0.857143	0.934841	1.0	0.8

LlamaIndex

构建测试集

加载文档

构建 QueryEngine

评估 QueryEngine

构建 `QueryEngine`

评估 `QueryEngine`