使用 Llama 4 评估 LlamaStack Web 搜索的事实性
在本教程中,我们将衡量 LlamaStack Web 搜索代理生成响应的事实性。LlamaStack 是一个由 Meta 维护的开源框架,它简化了大型语言模型应用(LLM 应用)的开发和部署。评估将使用 Ragas 指标,并以 Meta Llama 4 Maverick 作为判断者。
设置并运行 LlamaStack 服务器
此命令将安装 LlamaStack 服务器所需的所有依赖项,并使用 together 推理提供商
使用 conda 执行此命令
!pip install ragas langchain-together uv
!uv run --with llama-stack llama stack build --template together --image-type conda
使用 venv 执行此命令
!pip install ragas langchain-together uv
!uv run --with llama-stack llama stack build --template together --image-type venv
import os
import subprocess
def run_llama_stack_server_background():
log_file = open("llama_stack_server.log", "w")
process = subprocess.Popen(
"uv run --with llama-stack llama stack run together --image-type venv",
shell=True,
stdout=log_file,
stderr=log_file,
text=True,
)
print(f"Starting LlamaStack server with PID: {process.pid}")
return process
def wait_for_server_to_start():
import requests
from requests.exceptions import ConnectionError
import time
url = "http://0.0.0.0:8321/v1/health"
max_retries = 30
retry_interval = 1
print("Waiting for server to start", end="")
for _ in range(max_retries):
try:
response = requests.get(url)
if response.status_code == 200:
print("\nServer is ready!")
return True
except ConnectionError:
print(".", end="", flush=True)
time.sleep(retry_interval)
print("\nServer failed to start after", max_retries * retry_interval, "seconds")
return False
# use this helper if needed to kill the server
def kill_llama_stack_server():
# Kill any existing llama stack server processes
os.system(
"ps aux | grep -v grep | grep llama_stack.distribution.server.server | awk '{print $2}' | xargs kill -9"
)
启动 LlamaStack 服务器
构建搜索代理
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
client = LlamaStackClient(
base_url="http://0.0.0.0:8321",
)
agent = Agent(
client,
model="meta-llama/Llama-3.1-8B-Instruct",
instructions="You are a helpful assistant. Use web search tool to answer the questions.",
tools=["builtin::websearch"],
)
user_prompts = [
"In which major did Demis Hassabis complete his undergraduate degree? Search the web for the answer.",
"Ilya Sutskever is one of the key figures in AI. From which institution did he earn his PhD in machine learning? Search the web for the answer.",
"Sam Altman, widely known for his role at OpenAI, was born in which American city? Search the web for the answer.",
]
session_id = agent.create_session("test-session")
for prompt in user_prompts:
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=session_id,
)
for log in AgentEventLogger().log(response):
log.print()
现在,让我们更深入地了解代理的执行步骤,看看我们的代理表现如何。
session_response = client.agents.session.retrieve(
session_id=session_id,
agent_id=agent.agent_id,
)
评估代理响应
我们想要衡量 LlamaStack Web 搜索代理生成响应的事实性。为此,我们需要评估数据集(EvaluationDataset)和指标来评估基于事实的响应,Ragas 提供了多种现成指标,可用于衡量检索和生成的各个方面。
为了衡量响应的事实性,我们将使用:-
构建 Ragas 评估数据集
为了使用 Ragas 进行评估,我们将创建一个 EvaluationDataset
import json
# This function extracts the search results for the trace of each query
def extract_retrieved_contexts(turn_object):
results = []
for step in turn_object.steps:
if step.step_type == "tool_execution":
tool_responses = step.tool_responses
for response in tool_responses:
content = response.content
if content:
try:
parsed_result = json.loads(content)
results.append(parsed_result)
except json.JSONDecodeError:
print("Warning: Unable to parse tool response content as JSON.")
continue
retrieved_context = []
for result in results:
top_content_list = [item["content"] for item in result["top_k"]]
retrieved_context.extend(top_content_list)
return retrieved_context
from ragas.dataset_schema import EvaluationDataset
samples = []
references = [
"Demis Hassabis completed his undergraduate degree in Computer Science.",
"Ilya Sutskever earned his PhD from the University of Toronto.",
"Sam Altman was born in Chicago, Illinois.",
]
for i, turn in enumerate(session_response.turns):
samples.append(
{
"user_input": turn.input_messages[0].content,
"response": turn.output_message.content,
"reference": references[i],
"retrieved_contexts": extract_retrieved_contexts(turn),
}
)
ragas_eval_dataset = EvaluationDataset.from_list(samples)
user_input | retrieved_contexts | response | reference | |
---|---|---|---|---|
0 | Demis Hassabis 在哪个专业完成了他的... | [Demis Hassabis 拥有计算机学士学位... | Demis Hassabis 完成了他的本科学位... | Demis Hassabis 完成了他的本科学位... |
1 | Ilya Sutskever 是 AI 领域的关键人物之一... | [跳到内容 主菜单 搜索 捐赠 创建... | Ilya Sutskever 获得了机器学习领域的博士学位... | Ilya Sutskever 从大学获得了博士学位... |
2 | Sam Altman,因其在 OpenAI 的角色而闻名... | [Sam Altman | 传记, OpenAI, Microsoft, &... | Sam Altman 出生于美国伊利诺伊州芝加哥。 | Sam Altman 出生于伊利诺伊州芝加哥。 |
设置 Ragas 指标
from ragas.metrics import AnswerAccuracy, Faithfulness, ResponseGroundedness
from langchain_together import ChatTogether
from ragas.llms import LangchainLLMWrapper
llm = ChatTogether(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
)
evaluator_llm = LangchainLLMWrapper(llm)
ragas_metrics = [
AnswerAccuracy(llm=evaluator_llm),
Faithfulness(llm=evaluator_llm),
ResponseGroundedness(llm=evaluator_llm),
]
评估
最后,我们运行评估。
from ragas import evaluate
results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
results.to_pandas()
user_input | retrieved_contexts | response | reference | nv_accuracy | faithfulness | nv_response_groundedness | |
---|---|---|---|---|---|---|---|
0 | Demis Hassabis 在哪个专业完成了他的... | [Demis Hassabis 拥有计算机学士学位... | Demis Hassabis 完成了他的本科学位... | Demis Hassabis 完成了他的本科学位... | 1.0 | 1.0 | 1.00 |
1 | Ilya Sutskever 是 AI 领域的关键人物之一... | [跳到内容 主菜单 搜索 捐赠 创建... | Ilya Sutskever 获得了机器学习领域的博士学位... | Ilya Sutskever 从大学获得了博士学位... | 1.0 | 0.5 | 0.75 |
2 | Sam Altman,因其在 OpenAI 的角色而闻名... | [Sam Altman | 传记, OpenAI, Microsoft, &... | Sam Altman 出生于美国伊利诺伊州芝加哥。 | Sam Altman 出生于伊利诺伊州芝加哥。 | 1.0 | 1.0 | 1.00 |