跳转到内容

智能体或工具使用

智能体或工具使用工作流可以从多个维度进行评估。以下是一些可用于评估智能体或工具在特定任务中性能的指标。

主题一致性

部署在现实世界应用中的人工智能系统,在与用户互动时应遵守感兴趣的领域,但大型语言模型(LLM)有时可能会忽略这一限制,回答一般性问题。主题一致性指标评估人工智能在互动过程中保持在预定义领域内的能力。该指标在对话式人工智能系统中尤其重要,因为这些系统只应为与预定义领域相关的查询提供帮助。

TopicAdherenceScore 需要一组预定义的主题,人工智能系统应遵守这些主题,这通过 reference_topicsuser_input 提供。该指标可以计算主题一致性的精确率、召回率和 F1 分数,定义如下:

\[ \text{精确率 } = {|\text{已回答且符合任何当前参考主题的查询}| \over |\text{已回答且符合任何当前参考主题的查询}| + |\text{已回答但不符合任何当前参考主题的查询}|} \]
\[ \text{召回率 } = {|\text{已回答且符合任何当前参考主题的查询}| \over |\text{已回答且符合任何当前参考主题的查询}| + |\text{被拒绝但本应回答的查询}|} \]
\[ \text{F1 分数 } = {2 \times \text{精确率} \times \text{召回率} \over \text{精确率} + \text{召回率}} \]

示例

import asyncio

from ragas.dataset_schema import  SingleTurnSample, MultiTurnSample, EvaluationDataset
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics import TopicAdherenceScore
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI


evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))


async def evaluate_topic_adherence():

    sample_input_4 = [
        HumanMessage(
            content="Can you provide me with details about Einstein's theory of relativity?"
        ),
        AIMessage(
            content="Sure, let me retrieve the relevant information for you.",
            tool_calls=[
                ToolCall(
                    name="document_search",
                    args={"query": "Einstein's theory of relativity"},
                )
            ],
        ),
        ToolMessage(
            content="Found relevant documents: 1. Relativity: The Special and the General Theory, 2. General Theory of Relativity by A. Einstein."
        ),
        AIMessage(
            content="I found some documents on Einstein's theory of relativity. Which one would you like to know more about: 'Relativity: The Special and the General Theory' or 'General Theory of Relativity by A. Einstein'?"
        ),
        HumanMessage(content="Tell me about the 'General Theory of Relativity'."),
        AIMessage(
            content="Got it! Let me fetch more details from 'General Theory of Relativity by A. Einstein'.",
            tool_calls=[
                ToolCall(
                    name="document_retrieve",
                    args={"document": "General Theory of Relativity by A. Einstein"},
                )
            ],
        ),
        ToolMessage(
            content="The document discusses how gravity affects the fabric of spacetime, describing the relationship between mass and spacetime curvature."
        ),
        AIMessage(
            content="The 'General Theory of Relativity' explains how gravity affects the fabric of spacetime and the relationship between mass and spacetime curvature. Would you like more details or a specific explanation?"
        ),
        HumanMessage(
            content="No, that's perfect. By the way, do you know any good recipes for a chocolate cake?"
        ),
        AIMessage(
            content="Sure! Let me find a simple and delicious recipe for a chocolate cake.",
            tool_calls=[
                ToolCall(name="recipe_search", args={"query": "chocolate cake recipe"})
            ],
        ),
        ToolMessage(
            content="Here’s a popular recipe for a chocolate cake: Ingredients include flour, sugar, cocoa powder, eggs, milk, and butter. Instructions: Mix dry ingredients, add wet ingredients, and bake at 350°F for 30-35 minutes."
        ),
        AIMessage(
            content="I found a great recipe for chocolate cake! Would you like the full details, or is that summary enough?"
        ),
    ]

    sample = MultiTurnSample(user_input=sample_input_4, reference_topics=["science"])
    scorer = TopicAdherenceScore(llm=evaluator_llm, mode="precision")
    score = await scorer.multi_turn_ascore(sample)
    print(score)


if __name__ == "__main__":

    asyncio.run(evaluate_topic_adherence())
输出
0.6666666666444444

要将模式更改为召回率,请将 mode 参数设置为 recall

scorer = TopicAdherenceScore(llm = evaluator_llm, mode="recall")
输出
0.99999999995

工具调用准确率

ToolCallAccuracy 是一个可用于评估 LLM 在识别和调用所需工具以完成特定任务方面性能的指标。该指标需要 user_inputreference_tool_calls 来评估 LLM 在识别和调用所需工具以完成特定任务方面的性能。该指标通过将 reference_tool_calls 与 AI 进行的工具调用进行比较来计算。其值范围在 0 到 1 之间,值越高表示性能越好。

import asyncio

from ragas.dataset_schema import MultiTurnSample
from ragas.messages import AIMessage, HumanMessage, ToolCall, ToolMessage
from ragas.metrics import ToolCallAccuracy


async def evaluate_tool_call_accuracy():
    sample = [
        HumanMessage(content="What's the weather like in New York right now?"),
        AIMessage(
            content="The current temperature in New York is 75°F and it's partly cloudy.",
            tool_calls=[ToolCall(name="weather_check", args={"location": "New York"})],
        ),
        HumanMessage(content="Can you translate that to Celsius?"),
        AIMessage(
            content="Let me convert that to Celsius for you.",
            tool_calls=[
                ToolCall(
                    name="temperature_conversion", args={"temperature_fahrenheit": 75}
                )
            ],
        ),
        ToolMessage(content="75°F is approximately 23.9°C."),
        AIMessage(content="75°F is approximately 23.9°C."),
    ]

    sample = MultiTurnSample(
        user_input=sample,
        reference_tool_calls=[
            ToolCall(name="weather_check", args={"location": "New York"}),
            ToolCall(
                name="temperature_conversion", args={"temperature_fahrenheit": 75}
            ),
        ],
    )

    scorer = ToolCallAccuracy()
    score = await scorer.multi_turn_ascore(sample)
    print(score)


if __name__ == "__main__":
    asyncio.run(evaluate_tool_call_accuracy())
输出
1.0

reference_tool_calls 中指定的工具调用序列被用作理想结果。如果 AI 进行的工具调用与 reference_tool_calls 的顺序或序列不匹配,该指标将返回 0 分。这有助于确保 AI 能够以正确的顺序识别和调用所需的工具来完成特定任务。

默认情况下,工具名称和参数使用精确字符串匹配进行比较。但有时这可能不是最优的,例如当参数是自然语言字符串时。你也可以使用任何 ragas 指标(值在 0 和 1 之间)作为距离度量,以确定检索到的上下文是否相关。例如,

from ragas.metrics._string import NonLLMStringSimilarity
from ragas.metrics._tool_call_accuracy import ToolCallAccuracy

metric = ToolCallAccuracy()
metric.arg_comparison_metric = NonLLMStringSimilarity()

工具调用 F1 分数

ToolCallF1 是一个指标,它根据智能体进行的工具调用的精确率和召回率返回 F1 分数,并将其与一组预期调用(reference_tool_calls)进行比较。虽然 ToolCallAccuracy 基于精确的顺序和内容匹配提供二元分数,但 ToolCallF1 通过提供一种更“软”的评估来补充它,这对于初始阶段和迭代非常有用。它有助于量化即使在智能体过度或不足调用工具的情况下,其行为与预期行为的接近程度。

公式

ToolCallF1 基于经典的信息检索(IR)指标。它使用无序匹配:工具调用的顺序不影响结果,只考虑工具名称和参数的存在性与正确性。

\[ \text{精确率} = \frac{\text{名称和参数均匹配的工具调用}}{\text{名称和参数均匹配的工具调用} + \text{非预期的额外工具调用}} \]
\[ \text{召回率} = \frac{\text{名称和参数均匹配的工具调用}}{\text{名称和参数均匹配的工具调用} + \text{预期但未进行的工具调用}} \]
\[ \text{F1} = \frac{2 \cdot \text{精确率} \cdot \text{召回率}}{\text{精确率} + \text{召回率}} \]

它与主题一致性(Topic Adherence)有何不同?

虽然 ToolCallF1TopicAdherenceScore 都使用精确率、召回率和 F1 分数,但它们评估的方面不同:

指标 评估内容 基于
ToolCallF1 工具执行的正确性 结构化的工具调用对象
TopicAdherenceScore 对话是否保持在主题内 领域主题的比较

当您想跟踪智能体是否正确**执行工具**时,请使用 ToolCallF1。当评估**内容或意图**是否保持在允许的主题范围内时,请使用 TopicAdherenceScore

示例:匹配预期的工具调用

from ragas.metrics import ToolCallF1
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolMessage, ToolCall

sample = [
    HumanMessage(content="What's the weather like in Paris today?"),
    AIMessage(content="Let me check that for you.", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Paris"})
    ]),
    HumanMessage(content="And the UV index?"),
    AIMessage(content="Sure, here's the UV index for Paris.", tool_calls=[
        ToolCall(name="uv_index_lookup", args={"location": "Paris"})
    ])
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "Paris"}),
        ToolCall(name="uv_index_lookup", args={"location": "Paris"})
    ]
)

scorer = ToolCallF1()
await scorer.multi_turn_ascore(sample)

输出

1.0

示例:调用了额外的工具

sample = [
    HumanMessage(content="What's the weather like in Paris today?"),
    AIMessage(content="Let me check that for you.", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Paris"})
    ]),
    HumanMessage(content="And the UV index?"),
    AIMessage(content="Sure, here's the UV index for Paris.", tool_calls=[
        ToolCall(name="uv_index_lookup", args={"location": "Paris"}),
        ToolCall(name="air_quality", args={"location": "Paris"})  # extra call
    ])
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="uv_index_lookup", args={"location": "Paris"}),
        ToolCall(name="weather_check", args={"location": "Paris"})
    ]
)

await scorer.multi_turn_ascore(sample)

输出

0.67

在这种情况下,智能体调用了两个正确的工具,但额外增加了一个 air_quality 调用。F1 分数反映了部分正确性,而不是将该示例完全判为失败。

智能体目标准确率

智能体目标准确率是一个可用于评估 LLM 在识别和实现用户目标方面性能的指标。这是一个二元指标,1 表示 AI 已实现目标,0 表示 AI 未实现目标。

有参考信息

使用参考信息计算 AgentGoalAccuracyWithReference 需要 user_inputreference 来评估 LLM 在识别和实现用户目标方面的性能。带注释的 reference 将被用作理想结果。该指标通过将 reference 与工作流结束时实现的目标进行比较来计算。

import asyncio

from langchain_openai import ChatOpenAI
from ragas.dataset_schema import MultiTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.messages import AIMessage, HumanMessage, ToolCall, ToolMessage
from ragas.metrics import AgentGoalAccuracyWithReference

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))


async def evaluate_agent_goal_accuracy_with_reference():
    sample = MultiTurnSample(
        user_input=[
            HumanMessage(
                content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"
            ),
            AIMessage(
                content="Sure, let me find the best options for you.",
                tool_calls=[
                    ToolCall(
                        name="restaurant_search",
                        args={"cuisine": "Chinese", "time": "8:00pm"},
                    )
                ],
            ),
            ToolMessage(
                content="Found a few options: 1. Golden Dragon, 2. Jade Palace"
            ),
            AIMessage(
                content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"
            ),
            HumanMessage(content="Let's go with Golden Dragon."),
            AIMessage(
                content="Great choice! I'll book a table for 8:00pm at Golden Dragon.",
                tool_calls=[
                    ToolCall(
                        name="restaurant_book",
                        args={"name": "Golden Dragon", "time": "8:00pm"},
                    )
                ],
            ),
            ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
            AIMessage(
                content="Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!"
            ),
            HumanMessage(content="thanks"),
        ],
        reference="Table booked at one of the chinese restaurants at 8 pm",
    )

    scorer = AgentGoalAccuracyWithReference(llm=evaluator_llm)
    score = await scorer.multi_turn_ascore(sample)
    print(score)


if __name__ == "__main__":
    asyncio.run(evaluate_agent_goal_accuracy_with_reference())
输出
1.0

无参考信息

AgentGoalAccuracyWithoutReference 在无参考信息模式下工作,该指标将在没有任何参考信息的情况下评估 LLM 在识别和实现用户目标方面的性能。在此模式下,期望的结果是从工作流中的人类互动中推断出来的。

示例

import asyncio

from langchain_openai import ChatOpenAI
from ragas.dataset_schema import MultiTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.messages import AIMessage, HumanMessage, ToolCall, ToolMessage
from ragas.metrics import AgentGoalAccuracyWithoutReference

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))


async def evaluate_agent_goal_accuracy_without_reference():

    sample = MultiTurnSample(
        user_input=[
            HumanMessage(
                content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"
            ),
            AIMessage(
                content="Sure, let me find the best options for you.",
                tool_calls=[
                    ToolCall(
                        name="restaurant_search",
                        args={"cuisine": "Chinese", "time": "8:00pm"},
                    )
                ],
            ),
            ToolMessage(
                content="Found a few options: 1. Golden Dragon, 2. Jade Palace"
            ),
            AIMessage(
                content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"
            ),
            HumanMessage(content="Let's go with Golden Dragon."),
            AIMessage(
                content="Great choice! I'll book a table for 8:00pm at Golden Dragon.",
                tool_calls=[
                    ToolCall(
                        name="restaurant_book",
                        args={"name": "Golden Dragon", "time": "8:00pm"},
                    )
                ],
            ),
            ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
            AIMessage(
                content="Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!"
            ),
            HumanMessage(content="thanks"),
        ]
    )

    scorer = AgentGoalAccuracyWithoutReference(llm=evaluator_llm)
    score = await scorer.multi_turn_ascore(sample)
    print(score)

if __name__ == "__main__":
    asyncio.run(evaluate_agent_goal_accuracy_without_reference())
输出
1.0