智能体或工具使用

智能体或工具使用工作流程可以在多个维度上进行评估。以下是一些可用于评估智能体或工具在给定任务中性能的指标。

主题一致性

部署在实际应用中的 AI 系统在与用户交互时应遵守特定的领域，但 LLM 有时可能会忽略此限制而回答通用查询。主题一致性指标评估 AI 在交互过程中保持在预定义领域内的能力。此指标在会话式 AI 系统中尤为重要，因为 AI 预期仅为与预定义领域相关的查询提供帮助。

TopicAdherenceScore 需要 AI 系统应遵守的预定义主题集合，该集合通过 reference_topics 与 user_input 一起提供。此指标可以计算主题一致性的精度（precision）、召回率（recall）和 F1 分数，定义如下：

\[ \text{Precision } = {|\text{Queries that are answered and are adheres to any present reference topics}| \over |\text{Queries that are answered and are adheres to any present reference topics}| + |\text{Queries that are answered and do not adheres to any present reference topics}|} \]

\[ \text{Recall } = {|\text{Queries that are answered and are adheres to any present reference topics}| \over |\text{Queries that are answered and are adheres to any present reference topics}| + |\text{Queries that were refused and should have been answered}|} \]

\[ \text{F1 Score } = {2 \times \text{Precision} \times \text{Recall} \over \text{Precision} + \text{Recall}} \]

示例

from ragas.dataset_schema import  SingleTurnSample, MultiTurnSample, EvaluationDataset
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics import TopicAdherenceScore


sample_input_4 = [
HumanMessage(content="Can you provide me with details about Einstein's theory of relativity?"),
AIMessage(content="Sure, let me retrieve the relevant information for you.", tool_calls=[
    ToolCall(name="document_search", args={"query": "Einstein's theory of relativity"})
]),
ToolMessage(content="Found relevant documents: 1. Relativity: The Special and the General Theory, 2. General Theory of Relativity by A. Einstein."),
AIMessage(content="I found some documents on Einstein's theory of relativity. Which one would you like to know more about: 'Relativity: The Special and the General Theory' or 'General Theory of Relativity by A. Einstein'?"),
HumanMessage(content="Tell me about the 'General Theory of Relativity'."),
AIMessage(content="Got it! Let me fetch more details from 'General Theory of Relativity by A. Einstein'.", tool_calls=[
    ToolCall(name="document_retrieve", args={"document": "General Theory of Relativity by A. Einstein"})
]),
ToolMessage(content="The document discusses how gravity affects the fabric of spacetime, describing the relationship between mass and spacetime curvature."),
AIMessage(content="The 'General Theory of Relativity' explains how gravity affects the fabric of spacetime and the relationship between mass and spacetime curvature. Would you like more details or a specific explanation?"),
HumanMessage(content="No, that's perfect. By the way, do you know any good recipes for a chocolate cake?"),
AIMessage(content="Sure! Let me find a simple and delicious recipe for a chocolate cake.", tool_calls=[
    ToolCall(name="recipe_search", args={"query": "chocolate cake recipe"})
]),
ToolMessage(content="Here’s a popular recipe for a chocolate cake: Ingredients include flour, sugar, cocoa powder, eggs, milk, and butter. Instructions: Mix dry ingredients, add wet ingredients, and bake at 350°F for 30-35 minutes."),
AIMessage(content="I found a great recipe for chocolate cake! Would you like the full details, or is that summary enough?")
]


sample = MultiTurnSample(user_input=sample_input_4, reference_topics=["science"])
scorer = TopicAdherenceScore(llm = evaluator_llm, mode="precision")
await scorer.multi_turn_ascore(sample)

输出

0.6666666666444444

要将模式更改为召回率，请将 `mode` 参数设置为 `recall`。

scorer = TopicAdherenceScore(llm = evaluator_llm, mode="recall")

输出

0.99999999995

工具调用准确性

ToolCallAccuracy 是一个指标，用于评估 LLM 在识别和调用完成给定任务所需工具方面的性能。此指标需要 user_input 和 reference_tool_calls 来评估 LLM 在识别和调用完成给定任务所需工具方面的性能。该指标通过比较 reference_tool_calls 与 AI 进行的工具调用来计算。值范围在 0 到 1 之间，值越高表示性能越好。

from ragas.metrics import ToolCallAccuracy
from ragas.dataset_schema import  MultiTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall

sample = [
    HumanMessage(content="What's the weather like in New York right now?"),
    AIMessage(content="The current temperature in New York is 75°F and it's partly cloudy.", tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
    HumanMessage(content="Can you translate that to Celsius?"),
    AIMessage(content="Let me convert that to Celsius for you.", tool_calls=[
        ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
    ]),
    ToolMessage(content="75°F is approximately 23.9°C."),
    AIMessage(content="75°F is approximately 23.9°C.")
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
    ]
)

scorer = ToolCallAccuracy()
await scorer.multi_turn_ascore(sample)

输出

1.0

reference_tool_calls 中指定的工具调用序列被用作理想结果。如果 AI 进行的工具调用与 reference_tool_calls 的顺序或序列不匹配，该指标将返回分数为 0。这有助于确保 AI 能够按正确的顺序识别和调用所需工具来完成给定任务。

默认情况下，工具名称和参数使用精确字符串匹配进行比较。但有时这可能不是最优的，例如当参数是自然语言字符串时。您还可以使用任何 ragas 指标（值介于 0 和 1 之间）作为距离度量来判断检索到的上下文是否相关。例如，

from ragas.metrics._string import NonLLMStringSimilarity
from ragas.metrics._tool_call_accuracy import ToolCallAccuracy

metric = ToolCallAccuracy()
metric.arg_comparison_metric = NonLLMStringSimilarity()

智能体目标准确性

智能体目标准确性是一个指标，用于评估 LLM 在识别和实现用户目标方面的性能。这是一个二元指标，1 表示 AI 已经实现目标，0 表示 AI 没有实现目标。

有参考

计算有参考的 AgentGoalAccuracyWithReference 需要 user_input 和 reference 来评估 LLM 在识别和实现用户目标方面的性能。带注释的 reference 将用作理想结果。该指标通过比较 reference 与工作流程结束时实现的目标来计算。

from ragas.dataset_schema import  MultiTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics import AgentGoalAccuracyWithReference


sample = MultiTurnSample(user_input=[
    HumanMessage(content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"),
    AIMessage(content="Sure, let me find the best options for you.", tool_calls=[
        ToolCall(name="restaurant_search", args={"cuisine": "Chinese", "time": "8:00pm"})
    ]),
    ToolMessage(content="Found a few options: 1. Golden Dragon, 2. Jade Palace"),
    AIMessage(content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"),
    HumanMessage(content="Let's go with Golden Dragon."),
    AIMessage(content="Great choice! I'll book a table for 8:00pm at Golden Dragon.", tool_calls=[
        ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8:00pm"})
    ]),
    ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
    AIMessage(content="Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!"),
    HumanMessage(content="thanks"),
],
    reference="Table booked at one of the chinese restaurants at 8 pm")

scorer = AgentGoalAccuracyWithReference(llm = evaluator_llm)
await scorer.multi_turn_ascore(sample)

输出

1.0

无参考

AgentGoalAccuracyWithoutReference 在无参考模式下工作，该指标将评估 LLM 在没有任何参考的情况下识别和实现用户目标方面的性能。在此模式下，期望结果是从工作流程中的人工交互推断出来的。

示例

from ragas.dataset_schema import  MultiTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics import AgentGoalAccuracyWithoutReference


sample = MultiTurnSample(user_input=[
    HumanMessage(content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"),
    AIMessage(content="Sure, let me find the best options for you.", tool_calls=[
        ToolCall(name="restaurant_search", args={"cuisine": "Chinese", "time": "8:00pm"})
    ]),
    ToolMessage(content="Found a few options: 1. Golden Dragon, 2. Jade Palace"),
    AIMessage(content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"),
    HumanMessage(content="Let's go with Golden Dragon."),
    AIMessage(content="Great choice! I'll book a table for 8:00pm at Golden Dragon.", tool_calls=[
        ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8:00pm"})
    ]),
    ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
    AIMessage(content="Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!"),
    HumanMessage(content="thanks"),
])

scorer = AgentGoalAccuracyWithoutReference(llm = evaluator_llm)
await scorer.multi_turn_ascore(sample)

输出

1.0