评估 LlamaIndex 智能体

构建能够智能使用工具并做出决策的智能体只是成功之路的一半；确保这些智能体准确、可靠且性能卓越，才是真正定义其成功的关键。 LlamaIndex 提供了多种创建智能体的方法，包括 FunctionAgents、CodeActAgents 和 ReActAgents。在本教程中，我们将探讨如何使用预构建的 Ragas 指标和自定义评估指标来评估这些不同类型的智能体。

让我们开始吧。

本教程分为三个综合部分

使用现成的 Ragas 指标进行评估 在这里，我们将考察两个基本的评估工具：AgentGoalAccuracy，它衡量智能体识别和实现用户预期目标的有效性；以及 Tool Call Accuracy，它评估智能体选择和调用适当工具以正确顺序完成任务的能力。
用于 CodeActAgent 评估的自定义指标 本节重点介绍 LlamaIndex 预构建的 CodeActAgent，演示如何开发针对代码生成智能体特定需求和功能的定制评估指标。
查询引擎工具评估 最后一部分探讨如何利用 Ragas RAG 指标来评估智能体内部的查询引擎功能，从而深入了解当智能体访问信息系统时检索的有效性和响应质量。

Ragas 智能体指标

为了演示使用 Ragas 指标进行评估，我们将创建一个带有单个 LlamaIndex Function Agent 的简单工作流程，并用它来涵盖基本功能。

点击查看 Function Agent 设置

from llama_index.llms.openai import OpenAI


async def send_message(to: str, content: str) -> str:
    """Dummy function to simulate sending an email."""
    return f"Successfully sent mail to {to}"

llm = OpenAI(model="gpt-4o-mini")

from llama_index.core.agent.workflow import FunctionAgent

agent = FunctionAgent(
    tools=[send_message],
    llm=llm,
    system_prompt="You are a helpful assistant of Jane",
)

智能体目标准确率

人工智能智能体的真正价值在于其理解用户需求并有效实现的能力。Agent Goal Accuracy（智能体目标准确性）作为一个基本指标，评估智能体是否成功完成了用户的预期目标。这一衡量标准至关重要，因为它直接反映了智能体解释用户需求并采取适当行动以满足这些需求的程度。

Ragas 提供了该指标的两个关键变体

AgentGoalAccuracyWithReference - 一种二元评估（1 或 0），将智能体的最终结果与预定义的预期结果进行比较。
AgentGoalAccuracyWithoutReference - 一种二元评估（1 或 0），它基于推断出的意图而非预定义的期望来评估智能体是否实现了用户的目标。

With Reference（有参考）非常适用于预期结果明确定义的场景，例如在受控的测试环境中或在与基准数据进行测试时。

from llama_index.core.agent.workflow import (
    AgentInput,
    AgentOutput,
    AgentStream, 
    ToolCall as LlamaToolCall,
    ToolCallResult,
)

handler =  agent.run(user_msg="Send a message to jhon asking for a meeting")

events = []

async for ev in handler.stream_events():
    if isinstance(ev, (AgentInput, AgentOutput, LlamaToolCall, ToolCallResult)):
        events.append(ev)
    elif isinstance(ev, AgentStream):
        print(f"{ev.delta}", end="", flush=True)
    elif isinstance(ev, ToolCallResult):
        print(
            f"\nCall {ev.tool_name} with {ev.tool_kwargs}\nReturned: {ev.tool_output}"
        )

response = await handler

输出

I have successfully sent a message to Jhon asking for a meeting.

from ragas.integrations.llama_index import convert_to_ragas_messages

ragas_messages = convert_to_ragas_messages(events)

from ragas.metrics import AgentGoalAccuracyWithoutReference
from ragas.llms import LlamaIndexLLMWrapper
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import ToolCall as RagasToolCall

evaluator_llm = LlamaIndexLLMWrapper(llm=llm)

sample = MultiTurnSample(
    user_input=ragas_messages,
)

agent_goal_accuracy_without_reference = AgentGoalAccuracyWithoutReference(llm=evaluator_llm)
await agent_goal_accuracy_without_reference.multi_turn_ascore(sample)

输出

1.0

from ragas.metrics import AgentGoalAccuracyWithReference

sample = MultiTurnSample(
    user_input=ragas_messages,
    reference="Successfully sent a message to Jhon asking for a meeting"
)


agent_goal_accuracy_with_reference = AgentGoalAccuracyWithReference(llm=evaluator_llm)
await agent_goal_accuracy_with_reference.multi_turn_ascore(sample)

输出

1.0

工具调用准确率

在智能体工作流中，人工智能智能体的有效性在很大程度上取决于其在正确的时间选择和使用正确工具的能力。Tool Call Accuracy（工具调用准确性）指标评估智能体识别和调用适当工具以正确顺序完成用户请求的精确度。这一衡量标准确保智能体不仅了解有哪些可用工具，还知道如何有效地编排它们以实现预期结果。

ToolCallAccuracy 将智能体实际的工具使用情况与预期的工具调用参考序列进行比较。如果智能体的工具选择或顺序与参考序列不同，该指标将返回 0 分，表示未能遵循完成任务的最佳路径。

from ragas.metrics import ToolCallAccuracy

sample = MultiTurnSample(
    user_input=ragas_messages,
    reference_tool_calls=[
        RagasToolCall(
            name="send_message",
            args={'to': 'jhon', 'content': 'Hi Jhon,\n\nI hope this message finds you well. I would like to schedule a meeting to discuss some important matters. Please let me know your availability.\n\nBest regards,\nJane'},
        ),
    ],
)

tool_accuracy_scorer = ToolCallAccuracy()
await tool_accuracy_scorer.multi_turn_ascore(sample)

输出

1.0

评估 LlamaIndex CodeAct 智能体

LlamaIndex 提供了一个预构建的 CodeAct Agent，可用于编写和执行代码，其灵感来源于最初的 CodeAct 论文。其理念是：Code Agent 不再输出简单的 JSON 对象，而是生成一个可执行的代码块——通常是像 Python 这样的高级语言。用代码而不是类 JSON 的片段来编写动作，可以提供更好的

可组合性：代码天然支持函数的嵌套和重用；JSON 动作则缺乏这种灵活性。
对象管理：代码能够优雅地处理操作输出（例如 `image = generate_image()`）；JSON 没有简洁的等效方式。
通用性：代码可以表达任何计算任务；JSON 则施加了不必要的限制。
在 LLM 训练数据中的表示：LLM 从训练数据中已经理解了代码，这使得代码成为比专门的 JSON 更自然的接口。

点击查看 CodeActAgent 设置

定义函数

from llama_index.llms.openai import OpenAI

# Configure the LLM
llm = OpenAI(model="gpt-4o-mini")


# Define a few helper functions
def add(a: int, b: int) -> int:
    """Add two numbers together"""
    return a + b


def subtract(a: int, b: int) -> int:
    """Subtract two numbers"""
    return a - b


def multiply(a: int, b: int) -> int:
    """Multiply two numbers"""
    return a * b


def divide(a: int, b: int) -> float:
    """Divide two numbers"""
    return a / b

创建代码执行器

CodeActAgent 将需要一个特定的 `code_execute_fn` 来执行智能体生成的代码。

from typing import Any, Dict, Tuple
import io
import contextlib
import ast
import traceback


class SimpleCodeExecutor:
    """
    A simple code executor that runs Python code with state persistence.

    This executor maintains a global and local state between executions,
    allowing for variables to persist across multiple code runs.

    NOTE: not safe for production use! Use with caution.
    """

    def __init__(self, locals: Dict[str, Any], globals: Dict[str, Any]):
        """
        Initialize the code executor.

        Args:
            locals: Local variables to use in the execution context
            globals: Global variables to use in the execution context
        """
        # State that persists between executions
        self.globals = globals
        self.locals = locals

    def execute(self, code: str) -> Tuple[bool, str, Any]:
        """
        Execute Python code and capture output and return values.

        Args:
            code: Python code to execute

        Returns:
            Dict with keys `success`, `output`, and `return_value`
        """
        # Capture stdout and stderr
        stdout = io.StringIO()
        stderr = io.StringIO()

        output = ""
        return_value = None
        try:
            # Execute with captured output
            with contextlib.redirect_stdout(
                stdout
            ), contextlib.redirect_stderr(stderr):
                # Try to detect if there's a return value (last expression)
                try:
                    tree = ast.parse(code)
                    last_node = tree.body[-1] if tree.body else None

                    # If the last statement is an expression, capture its value
                    if isinstance(last_node, ast.Expr):
                        # Split code to add a return value assignment
                        last_line = code.rstrip().split("\n")[-1]
                        exec_code = (
                            code[: -len(last_line)]
                            + "\n__result__ = "
                            + last_line
                        )

                        # Execute modified code
                        exec(exec_code, self.globals, self.locals)
                        return_value = self.locals.get("__result__")
                    else:
                        # Normal execution
                        exec(code, self.globals, self.locals)
                except:
                    # If parsing fails, just execute the code as is
                    exec(code, self.globals, self.locals)

            # Get output
            output = stdout.getvalue()
            if stderr.getvalue():
                output += "\n" + stderr.getvalue()

        except Exception as e:
            # Capture exception information
            output = f"Error: {type(e).__name__}: {str(e)}\n"
            output += traceback.format_exc()

        if return_value is not None:
            output += "\n\n" + str(return_value)

        return output

code_executor = SimpleCodeExecutor(
    # give access to our functions defined above
    locals={
        "add": add,
        "subtract": subtract,
        "multiply": multiply,
        "divide": divide,
    },
    globals={
        # give access to all builtins
        "__builtins__": __builtins__,
        # give access to numpy
        "np": __import__("numpy"),
    },
)

设置 CodeAct 智能体

from llama_index.core.agent.workflow import CodeActAgent
from llama_index.core.workflow import Context

agent = CodeActAgent(
    code_execute_fn=code_executor.execute,
    llm=llm,
    tools=[add, subtract, multiply, divide],
)

# context to hold the agent's session/state/chat history
ctx = Context(agent)

运行和评估 CodeAct 智能体

from llama_index.core.agent.workflow import (
    AgentInput,
    AgentOutput,
    AgentStream,
    ToolCall,
    ToolCallResult,
)

handler = agent.run("Calculate the sum of the first 10 fibonacci numbers", ctx=ctx)

events = []

async for event in handler.stream_events():
    if isinstance(event, (AgentInput, AgentOutput, ToolCall, ToolCallResult)):
        events.append(event)
    elif isinstance(event, AgentStream):
        print(f"{event.delta}", end="", flush=True)

The first 10 Fibonacci numbers are 0, 1, 1, 2, 3, 5, 8, 13, 21, and 34. I will calculate their sum.

<execute>
def fibonacci(n):
    fib_sequence = [0, 1]
    for i in range(2, n):
        next_fib = fib_sequence[-1] + fib_sequence[-2]
        fib_sequence.append(next_fib)
    return fib_sequence

# Calculate the first 10 Fibonacci numbers
first_10_fib = fibonacci(10)

# Calculate the sum of the first 10 Fibonacci numbers
sum_fib = sum(first_10_fib)
print(sum_fib)
</execute>The sum of the first 10 Fibonacci numbers is 88.

提取工具调用

CodeAct_agent_tool_call = events[2]
agent_code = CodeAct_agent_tool_call.tool_kwargs["code"]

print(agent_code)

输出

    def fibonacci(n):
        fib_sequence = [0, 1]
        for i in range(2, n):
            next_fib = fib_sequence[-1] + fib_sequence[-2]
            fib_sequence.append(next_fib)
        return fib_sequence

    # Calculate the first 10 Fibonacci numbers
    first_10_fib = fibonacci(10)

    # Calculate the sum of the first 10 Fibonacci numbers
    sum_fib = sum(first_10_fib)
    print(sum_fib)

在评估 CodeAct 智能体时，我们可以从检查基本功能的基础指标开始，例如代码的可编译性或适当的参数选择。这些直接的评估为进入更复杂的评估方法提供了坚实的基础。

Ragas 提供了强大的自定义指标功能，随着您需求的演变，可以实现越来越细致的评估。

AspectCritic - 提供二元评估（通过/失败），利用基于大语言模型的判断来确定智能体的响应是否满足用户定义的特定标准，从而提供明确的成功指标。
RubricScoreMetric - 根据全面、预定义的质量标准和离散的评分等级来评估智能体的响应，从而实现跨多个维度的一致性能评估。

def is_compilable(code_str: str, mode="exec") -> bool:
    try:
        compile(code_str, "<string>", mode)
        return True
    except Exception:
        return False

is_compilable(agent_code)

输出

True

from ragas.metrics import AspectCritic
from ragas.dataset_schema import SingleTurnSample
from ragas.llms import LlamaIndexLLMWrapper

llm = OpenAI(model="gpt-4o-mini")
evaluator_llm = LlamaIndexLLMWrapper(llm=llm)

correct_tool_args = AspectCritic(
    name="correct_tool_args",
    llm=evaluator_llm,
    definition="Score 1 if the tool arguements use in the tool call are correct and 0 otherwise",
)

sample = SingleTurnSample(
    user_input="Calculate the sum of the first 10 fibonacci numbers",
    response=agent_code,
)

await correct_tool_args.single_turn_ascore(sample)

输出

评估查询引擎工具

当使用 Ragas 指标进行评估时，我们需要确保数据的格式适合评估。在智能体系统中使用查询引擎工具时，我们可以像评估任何检索增强生成（RAG）系统一样进行评估。

我们将提取用户交互期间所有调用查询引擎工具的实例。利用这些实例，我们可以基于事件流数据构建一个 Ragas RAG 评估数据集。一旦数据集准备就绪，我们就可以应用全套 Ragas 评估指标。在本节中，我们将设置一个带有查询引擎工具的 Functional Agent。该智能体可以访问两个“工具”：一个用于查询 2021 年 Lyft 的 10-K 报告，另一个用于查询 2021 年 Uber 的 10-K 报告。

点击查看智能体设置

设置大语言模型

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

构建查询引擎工具

from llama_index.core import StorageContext, load_index_from_storage

try:
    storage_context = StorageContext.from_defaults(
        persist_dir="./storage/lyft"
    )
    lyft_index = load_index_from_storage(storage_context)

    storage_context = StorageContext.from_defaults(
        persist_dir="./storage/uber"
    )
    uber_index = load_index_from_storage(storage_context)

    index_loaded = True
except:
    index_loaded = False

!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

if not index_loaded:
    # load data
    lyft_docs = SimpleDirectoryReader(
        input_files=["./data/10k/lyft_2021.pdf"]
    ).load_data()
    uber_docs = SimpleDirectoryReader(
        input_files=["./data/10k/uber_2021.pdf"]
    ).load_data()

    # build index
    lyft_index = VectorStoreIndex.from_documents(lyft_docs)
    uber_index = VectorStoreIndex.from_documents(uber_docs)

    # persist index
    lyft_index.storage_context.persist(persist_dir="./storage/lyft")
    uber_index.storage_context.persist(persist_dir="./storage/uber")

lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)
uber_engine = uber_index.as_query_engine(similarity_top_k=3)

from llama_index.core.tools import QueryEngineTool

query_engine_tools = [
    QueryEngineTool.from_defaults(
        query_engine=lyft_engine,
        name="lyft_10k",
        description=(
            "Provides information about Lyft financials for year 2021. "
            "Use a detailed plain text question as input to the tool."
        ),
    ),
    QueryEngineTool.from_defaults(
        query_engine=uber_engine,
        name="uber_10k",
        description=(
            "Provides information about Uber financials for year 2021. "
            "Use a detailed plain text question as input to the tool."
        ),
    ),
]

智能体设置

from llama_index.core.agent.workflow import FunctionAgent, ReActAgent
from llama_index.core.workflow import Context

agent = FunctionAgent(tools=query_engine_tools, llm=OpenAI(model="gpt-4o-mini"))

# context to hold the session/state
ctx = Context(agent)

运行和评估智能体

from llama_index.core.agent.workflow import (
    AgentInput,
    AgentOutput,
    ToolCall,
    ToolCallResult,
    AgentStream, 
)

handler = agent.run("What's the revenue for Lyft in 2021 vs Uber?", ctx=ctx)

events = []

async for ev in handler.stream_events():
    if isinstance(ev, (AgentInput, AgentOutput, ToolCall, ToolCallResult)):
        events.append(ev)
    elif isinstance(ev, AgentStream):
        print(ev.delta, end="", flush=True)

response = await handler

输出

In 2021, Lyft generated a total revenue of $3.21 billion, while Uber's total revenue was significantly higher at $17.455 billion.

我们将提取在用户交互期间所有调用查询引擎工具的 `ToolCallResult` 实例，并利用这些实例根据您的事件流数据构建一个合适的 RAG 评估数据集。

from ragas.dataset_schema import SingleTurnSample

ragas_samples = []

for event in events:
    if isinstance(event, ToolCallResult):
        if event.tool_name in ["lyft_10k", "uber_10k"]:
            sample = SingleTurnSample(
                user_input=event.tool_kwargs["input"],
                response=event.tool_output.content,
                retrieved_contexts=[node.text for node in event.tool_output.raw_output.source_nodes]
                )
            ragas_samples.append(sample)

from ragas.dataset_schema import EvaluationDataset

dataset = EvaluationDataset(samples=ragas_samples)
dataset.to_pandas()

输出

	user_input	retrieved_contexts	response
0	Uber 在...年的总收入是多少？	[财务和运营亮点\n年终...	Uber 在 2021 年的总收入是...
1	Lyft 在...年的总收入是多少？	[重要项目\n受估算和...	Lyft 在 2021 年的总收入是...

默认情况下，生成的数据集将不包含参考答案，因此我们将仅限于使用不需要参考的指标。但是，如果您希望进行基于参考的评估，可以向数据集中添加一个参考列，然后应用相关的 Ragas 指标。

使用 Ragas RAG 指标进行评估

让我们评估一下查询引擎的有效性，特别是关于检索质量和防止幻觉方面。为了完成这次评估，我们将采用两个关键的 Ragas 指标：faithfulness（忠实度）和 context relevance（上下文相关性）。更多信息，您可以访问这里。

这种评估方法使我们能够识别出可能影响整体系统性能的检索质量或响应生成方面的潜在问题。 - Faithfulness（忠实度） - 衡量生成的响应在多大程度上准确地遵循了检索到的上下文中呈现的事实，确保系统所做的声明能够由所提供的信息直接支持。 - Context Relevance（上下文相关性） - 通过双重 LLM 判断机制评估检索到的信息与其相关性，从而评估其在多大程度上有效地解决了用户的特定查询。

from ragas import evaluate
from ragas.metrics import Faithfulness, ContextRelevance
from ragas.llms import LlamaIndexLLMWrapper
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o")
evaluator_llm = LlamaIndexLLMWrapper(llm=llm)

faithfulness = Faithfulness(llm=evaluator_llm)
context_precision = ContextRelevance(llm=evaluator_llm)

result = evaluate(dataset, metrics=[faithfulness, context_precision])

Evaluating: 100%|██████████| 4/4 [00:03<00:00,  1.19it/s]

result.to_pandas()

输出

	user_input	retrieved_contexts	response	faithfulness	nv_context_relevance
0	Uber 在...年的总收入是多少？	[财务和运营亮点\n年终...	Uber 在 2021 年的总收入是...	1.0	1.0
1	Lyft 在...年的总收入是多少？	[重要项目\n受估算和...	Lyft 在 2021 年的总收入是...	1.0	1.0