跳到内容

使用 Llama 4 评估 LlamaStack Web 搜索的事实性

在本教程中,我们将衡量 LlamaStack Web 搜索代理生成响应的事实性。LlamaStack 是一个由 Meta 维护的开源框架,它简化了大型语言模型应用(LLM 应用)的开发和部署。评估将使用 Ragas 指标,并以 Meta Llama 4 Maverick 作为判断者。

设置并运行 LlamaStack 服务器

此命令将安装 LlamaStack 服务器所需的所有依赖项,并使用 together 推理提供商

使用 conda 执行此命令

!pip install ragas langchain-together uv 
!uv run --with llama-stack llama stack build --template together --image-type conda

使用 venv 执行此命令

!pip install ragas langchain-together uv 
!uv run --with llama-stack llama stack build --template together --image-type venv

import os
import subprocess


def run_llama_stack_server_background():
    log_file = open("llama_stack_server.log", "w")
    process = subprocess.Popen(
        "uv run --with llama-stack llama stack run together --image-type venv",
        shell=True,
        stdout=log_file,
        stderr=log_file,
        text=True,
    )

    print(f"Starting LlamaStack server with PID: {process.pid}")
    return process


def wait_for_server_to_start():
    import requests
    from requests.exceptions import ConnectionError
    import time

    url = "http://0.0.0.0:8321/v1/health"
    max_retries = 30
    retry_interval = 1

    print("Waiting for server to start", end="")
    for _ in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                print("\nServer is ready!")
                return True
        except ConnectionError:
            print(".", end="", flush=True)
            time.sleep(retry_interval)

    print("\nServer failed to start after", max_retries * retry_interval, "seconds")
    return False


# use this helper if needed to kill the server
def kill_llama_stack_server():
    # Kill any existing llama stack server processes
    os.system(
        "ps aux | grep -v grep | grep llama_stack.distribution.server.server | awk '{print $2}' | xargs kill -9"
    )

启动 LlamaStack 服务器

server_process = run_llama_stack_server_background()
assert wait_for_server_to_start()
Starting LlamaStack server with PID: 95508
Waiting for server to start....
Server is ready!

构建搜索代理

from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger

client = LlamaStackClient(
    base_url="http://0.0.0.0:8321",
)

agent = Agent(
    client,
    model="meta-llama/Llama-3.1-8B-Instruct",
    instructions="You are a helpful assistant. Use web search tool to answer the questions.",
    tools=["builtin::websearch"],
)
user_prompts = [
    "In which major did Demis Hassabis complete his undergraduate degree? Search the web for the answer.",
    "Ilya Sutskever is one of the key figures in AI. From which institution did he earn his PhD in machine learning? Search the web for the answer.",
    "Sam Altman, widely known for his role at OpenAI, was born in which American city? Search the web for the answer.",
]

session_id = agent.create_session("test-session")


for prompt in user_prompts:
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )
    for log in AgentEventLogger().log(response):
        log.print()

现在,让我们更深入地了解代理的执行步骤,看看我们的代理表现如何。

session_response = client.agents.session.retrieve(
    session_id=session_id,
    agent_id=agent.agent_id,
)

评估代理响应

我们想要衡量 LlamaStack Web 搜索代理生成响应的事实性。为此,我们需要评估数据集(EvaluationDataset)和指标来评估基于事实的响应,Ragas 提供了多种现成指标,可用于衡量检索和生成的各个方面。

为了衡量响应的事实性,我们将使用:-

  1. 忠实度
  2. 响应事实性

构建 Ragas 评估数据集

为了使用 Ragas 进行评估,我们将创建一个 EvaluationDataset

import json

# This function extracts the search results for the trace of each query
def extract_retrieved_contexts(turn_object):
    results = []
    for step in turn_object.steps:
        if step.step_type == "tool_execution":
            tool_responses = step.tool_responses
            for response in tool_responses:
                content = response.content
                if content:
                    try:
                        parsed_result = json.loads(content)
                        results.append(parsed_result)
                    except json.JSONDecodeError:
                        print("Warning: Unable to parse tool response content as JSON.")
                        continue

    retrieved_context = []
    for result in results:
        top_content_list = [item["content"] for item in result["top_k"]]
        retrieved_context.extend(top_content_list)
    return retrieved_context
from ragas.dataset_schema import EvaluationDataset

samples = []

references = [
    "Demis Hassabis completed his undergraduate degree in Computer Science.",
    "Ilya Sutskever earned his PhD from the University of Toronto.",
    "Sam Altman was born in Chicago, Illinois.",
]

for i, turn in enumerate(session_response.turns):
    samples.append(
        {
            "user_input": turn.input_messages[0].content,
            "response": turn.output_message.content,
            "reference": references[i],
            "retrieved_contexts": extract_retrieved_contexts(turn),
        }
    )

ragas_eval_dataset = EvaluationDataset.from_list(samples)
ragas_eval_dataset.to_pandas()
user_input retrieved_contexts response reference
0 Demis Hassabis 在哪个专业完成了他的... [Demis Hassabis 拥有计算机学士学位... Demis Hassabis 完成了他的本科学位... Demis Hassabis 完成了他的本科学位...
1 Ilya Sutskever 是 AI 领域的关键人物之一... [跳到内容 主菜单 搜索 捐赠 创建... Ilya Sutskever 获得了机器学习领域的博士学位... Ilya Sutskever 从大学获得了博士学位...
2 Sam Altman,因其在 OpenAI 的角色而闻名... [Sam Altman | 传记, OpenAI, Microsoft, &... Sam Altman 出生于美国伊利诺伊州芝加哥。 Sam Altman 出生于伊利诺伊州芝加哥。

设置 Ragas 指标

from ragas.metrics import AnswerAccuracy, Faithfulness, ResponseGroundedness
from langchain_together import ChatTogether
from ragas.llms import LangchainLLMWrapper

llm = ChatTogether(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
)
evaluator_llm = LangchainLLMWrapper(llm)

ragas_metrics = [
    AnswerAccuracy(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
    ResponseGroundedness(llm=evaluator_llm),
]

评估

最后,我们运行评估。

from ragas import evaluate

results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
results.to_pandas()
Evaluating: 100%|██████████| 9/9 [00:04<00:00,  2.03it/s]

user_input retrieved_contexts response reference nv_accuracy faithfulness nv_response_groundedness
0 Demis Hassabis 在哪个专业完成了他的... [Demis Hassabis 拥有计算机学士学位... Demis Hassabis 完成了他的本科学位... Demis Hassabis 完成了他的本科学位... 1.0 1.0 1.00
1 Ilya Sutskever 是 AI 领域的关键人物之一... [跳到内容 主菜单 搜索 捐赠 创建... Ilya Sutskever 获得了机器学习领域的博士学位... Ilya Sutskever 从大学获得了博士学位... 1.0 0.5 0.75
2 Sam Altman,因其在 OpenAI 的角色而闻名... [Sam Altman | 传记, OpenAI, Microsoft, &... Sam Altman 出生于美国伊利诺伊州芝加哥。 Sam Altman 出生于伊利诺伊州芝加哥。 1.0 1.0 1.00
kill_llama_stack_server()