Haystack 集成

Haystack 是一个 LLM 编排框架，用于构建可自定义、可用于生产环境的 LLM 应用。

Haystack 的基本概念是，所有独立任务，例如存储文档、检索相关数据和生成响应，都由模块化组件处理，如文档存储、检索器和生成器，它们通过管道无缝连接和编排。

概览

在本教程中，我们将使用 Haystack 构建一个 RAG 管道，并使用 Ragas 对其进行评估。我们将首先设置 RAG 管道的各种组件，并初始化 RagasEvaluator 组件用于评估。组件设置完成后，我们将连接这些组件以构成完整的管道。在本教程的后续部分，我们将探讨如何使用 Ragas 中自定义的指标执行评估。

安装依赖

%pip install ragas-haystack

获取数据

dataset = [
    "OpenAI is one of the most recognized names in the large language model space, known for its GPT series of models. These models excel at generating human-like text and performing tasks like creative writing, answering questions, and summarizing content. GPT-4, their latest release, has set benchmarks in understanding context and delivering detailed responses.",
    "Anthropic is well-known for its Claude series of language models, designed with a strong focus on safety and ethical AI behavior. Claude is particularly praised for its ability to follow complex instructions and generate text that aligns closely with user intent.",
    "DeepMind, a division of Google, is recognized for its cutting-edge Gemini models, which are integrated into various Google products like Bard and Workspace tools. These models are renowned for their conversational abilities and their capacity to handle complex, multi-turn dialogues.",
    "Meta AI is best known for its LLaMA (Large Language Model Meta AI) series, which has been made open-source for researchers and developers. LLaMA models are praised for their ability to support innovation and experimentation due to their accessibility and strong performance.",
    "Meta AI with it's LLaMA models aims to democratize AI development by making high-quality models available for free, fostering collaboration across industries. Their open-source approach has been a game-changer for researchers without access to expensive resources.",
    "Microsoft’s Azure AI platform is famous for integrating OpenAI’s GPT models, enabling businesses to use these advanced models in a scalable and secure cloud environment. Azure AI powers applications like Copilot in Office 365, helping users draft emails, generate summaries, and more.",
    "Amazon’s Bedrock platform is recognized for providing access to various language models, including its own models and third-party ones like Anthropic’s Claude and AI21’s Jurassic. Bedrock is especially valued for its flexibility, allowing users to choose models based on their specific needs.",
    "Cohere is well-known for its language models tailored for business use, excelling in tasks like search, summarization, and customer support. Their models are recognized for being efficient, cost-effective, and easy to integrate into workflows.",
    "AI21 Labs is famous for its Jurassic series of language models, which are highly versatile and capable of handling tasks like content creation and code generation. The Jurassic models stand out for their natural language understanding and ability to generate detailed and coherent responses.",
    "In the rapidly advancing field of artificial intelligence, several companies have made significant contributions with their large language models. Notable players include OpenAI, known for its GPT Series (including GPT-4); Anthropic, which offers the Claude Series; Google DeepMind with its Gemini Models; Meta AI, recognized for its LLaMA Series; Microsoft Azure AI, which integrates OpenAI’s GPT Models; Amazon AWS (Bedrock), providing access to various models including Claude (Anthropic) and Jurassic (AI21 Labs); Cohere, which offers its own models tailored for business use; and AI21 Labs, known for its Jurassic Series. These companies are shaping the landscape of AI by providing powerful models with diverse capabilities.",
]

为 RAG 管道初始化组件

初始化文档存储

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
docs = [Document(content=doc) for doc in dataset]

初始化文档和文本嵌入器

from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder

document_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
text_embedder = OpenAITextEmbedder(model="text-embedding-3-small")

现在我们有了文档存储和文档嵌入器，我们将使用它们填充我们的向量数据存储。

docs_with_embeddings = document_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

初始化检索器

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

retriever = InMemoryEmbeddingRetriever(document_store, top_k=2)

定义模板提示

from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage

template = [
    ChatMessage.from_user(
        """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
    )
]

prompt_builder = ChatPromptBuilder(template=template)

初始化聊天生成器

from haystack.components.generators.chat import OpenAIChatGenerator

chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")

设置 RagasEvaluator

传入所有您希望用于评估的 Ragas 指标，确保为计算每个选定的指标提供了所有必要信息。

例如

AnswerRelevancy (答案相关性)：需要同时提供查询 (query) 和响应 (response)。
ContextPrecision (上下文精度)：需要同时提供查询 (query)、检索到的文档 (retrieved documents) 和参考 (reference)。
Faithfulness (忠实度)：需要同时提供查询 (query)、检索到的文档 (retrieved documents) 和响应 (response)。

请确保为每个指标包含所有相关数据，以确保评估的准确性。

from haystack_integrations.components.evaluators.ragas import RagasEvaluator

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness

llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)

ragas_evaluator = RagasEvaluator(
    ragas_metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],
    evaluator_llm=evaluator_llm,
)

构建和组装管道

创建管道

from haystack import Pipeline

rag_pipeline = Pipeline()

添加组件

from haystack.components.builders import AnswerBuilder

rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", chat_generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())
rag_pipeline.add_component("ragas_evaluator", ragas_evaluator)

连接组件

rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
rag_pipeline.connect("retriever", "ragas_evaluator.documents")
rag_pipeline.connect("llm.replies", "ragas_evaluator.response")

运行管道

question = "What makes Meta AI’s LLaMA models stand out?"

reference = "Meta AI’s LLaMA models stand out for being open-source, supporting innovation and experimentation due to their accessibility and strong performance."


result = rag_pipeline.run(
    {
        "text_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
        "ragas_evaluator": {"query": question, "reference": reference},
        # Each metric expects a specific set of parameters as input. Refer to the
        # Ragas class' documentation for more details.
    }
)

print(result['answer_builder']['answers'][0].data, '\n')
print(result['ragas_evaluator']['result'])

输出

Evaluating: 100%|██████████| 3/3 [00:14<00:00,  4.72s/it]

Meta AI's LLaMA models stand out due to their open-source nature, which allows researchers and developers easy access to high-quality language models without the need for expensive resources. This accessibility fosters innovation and experimentation, enabling collaboration across various industries. Moreover, the strong performance of the LLaMA models further enhances their appeal, making them valuable tools for advancing AI development. 

{'answer_relevancy': 0.9782, 'context_precision': 1.0000, 'faithfulness': 1.0000}

高级用法

除了使用默认的 Ragas 指标外，您可以根据自己的需要修改它们，甚至创建自己的自定义指标。之后，您可以将这些自定义指标传递给 RagasEvaluator 组件。要了解如何自定义 Ragas 指标的更多信息，请查看文档。

在下面的示例中，我们将定义两个自定义的 Ragas 指标

SportsRelevanceMetric (运动相关性指标)：此指标评估问题及其响应是否与运动相关。
AnswerQualityMetric (答案质量指标)：此指标衡量 LLM 提供的响应回答用户问题的程度。

from ragas.metrics import RubricsScore, AspectCritic

SportsRelevanceMetric = AspectCritic(
    name="sports_relevance_metric",
    definition="Were the question and response related to sports?",
    llm=evaluator_llm,
)

rubrics = {
    "score1_description": "The response does not answer the user input.",
    "score2_description": "The response partially answers the user input.",
    "score3_description": "The response fully answer the user input"
}

evaluator = RagasEvaluator(
    ragas_metrics=[SportsRelevanceMetric, RubricsScore(llm=evaluator_llm, rubrics=rubrics)],
    evaluator_llm=evaluator_llm
)

output = evaluator.run(
    query="Which is the most popular global sport?",
    documents=[
        "Football is undoubtedly the world's most popular sport with"
        " major events like the FIFA World Cup and sports personalities"
        " like Ronaldo and Messi, drawing a followership of more than 4"
        " billion people."
    ],
    response="Football is the most popular sport with around 4 billion"
                " followers worldwide",
)

output['result']

输出

Evaluating: 100%|██████████| 2/2 [00:01<00:00,  1.62it/s]

{'sports_relevance_metric': 1.0000, 'domain_specific_rubrics': 3.0000}