跳转到内容

评估多轮对话

本教程的灵感来源于 Hamel 关于评估基于 LLM 应用的多轮对话的笔记。其目标是使用 Ragas 指标创建一个简单且可操作的评估框架,以明确定义何为成功的对话。在本教程结束时,您将能够根据从 AI 应用的错误分析中获得的见解,进行多轮对话评估。

Ragas 指标

Ragas 提供了 AspectCritic,这是一个强大的评估指标,用于评估具有二元结果的多轮对话。它有助于确定对话是否满足预定义的成功标准。

AspectCritic AspectCritic 根据以自由形式自然语言编写的预定义方面来评估响应。它返回一个二元输出,指示响应是否与定义的方面一致。

该指标与 Hamel 的建议相一致,即关注二元决策,这可以消除歧义,并为提高对话质量提供清晰、可操作的方法。

实践案例 – 评估银行聊天机器人

在评估时,请专注于与用户需求直接相关的指标。分数的任何变化都应反映出对用户体验有意义的影响。

让我们考虑一个您正在为银行构建聊天机器人的例子。

在进行错误分析后,您发现聊天机器人偶尔会忘记被要求完成的任务,或者只部分完成任务。为了提高聊天机器人的性能,您需要一种可靠的方法来衡量和评估这种行为。

注意:在定义评分标准时,请使用标准化的术语。 - 将用户的消息称为 human 消息。 - 将聊天机器人的消息称为 AI 消息。

from dotenv import load_dotenv
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

load_dotenv()

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
from ragas.metrics import AspectCritic, SimpleCriteriaScore
from ragas.dataset_schema import MultiTurnSample, EvaluationDataset
from ragas.messages import HumanMessage, AIMessage
from ragas import evaluate


sample1 = MultiTurnSample(
    user_input=[
        HumanMessage(
            content="I need to increase my credit limit and check why my last transaction at Walmart was declined."
        ),
        AIMessage(
            content="Let's check the declined transaction first, the $234.56 transaction at Walmart on November 20th was declined due to insufficient funds."
        ),
        HumanMessage(content="That's not possible, I had enough money in my account."),
        AIMessage(
            content="I understand your concern. Let me check your account details... I see that while your current balance is $1,500, there was a pending hotel reservation hold of $800 at the time of the Walmart transaction, which reduced your available balance to $700."
        ),
        HumanMessage(content="Oh, I forgot about the hotel booking."),
        AIMessage(content="Is there anything else I can assist you with?"),
        HumanMessage(content="What about increasing my credit limit?"),
        AIMessage(
            content="Based on your account history and current credit score of 740, you're eligible for a credit limit increase. The maximum increase available is $2,000. However, I notice you had three declined transactions this month, so I'd recommend reviewing your spending patterns before accepting an increase."
        ),
    ]
)

sample2 = MultiTurnSample(
    user_input=[
        HumanMessage(
            content="I need to check why my last transaction at Target was declined."
        ),
        AIMessage(
            content="Let's look at the declined transaction. The $145.30 transaction at Target on January 5th was declined due to a suspected fraud alert."
        ),
        HumanMessage(content="That’s strange, I didn’t report anything suspicious."),
        AIMessage(
            content="I understand. Let me look deeper. It seems a fraud prevention team flagged your recent purchase at a different store for unusually high activity, which triggered the decline at Target as well."
        ),
        HumanMessage(content="Ah, that makes sense. I did shop a lot that day."),
        AIMessage(content="Is there anything else I can assist you with?"),
    ]
)
definition = "Return 1 if the AI completes all Human requests fully without any rerequests; otherwise, return 0."

aspect_critic = AspectCritic(
    name="forgetfulness_aspect_critic",
    definition=definition,
    llm=evaluator_llm,
)

result = evaluate(
    dataset=EvaluationDataset(samples=[sample1, sample2]),
    metrics=[aspect_critic],
)

result.to_pandas()
输出
Evaluating: 100%|██████████| 2/2 [00:00<?, ?it/s]

user_input forgetfulness_aspect_critic
0 [{'content': '我需要提高我的信用额度...'... 0
1 [{'content': '我需要查询为什么我的上一笔交易...'... 1

当使用基于 LLM 的指标进行评估时,每个指标可能涉及一次或多次对 LLM 的调用。评估的追踪信息可以为理解结果和诊断任何问题提供见解。您可以通过访问此页面找到有关此过程的更多详细信息。

在错误分析中发现的另一种模式是,您的银行聊天机器人偶尔会从讨论基本的账户服务偏离到提供未经授权的投资建议。为了维持用户信任并确保合规,您希望系统在对话接近这些边界时实现平稳过渡。您可以通过定义如下指标来实现这一点。

definition = "Return 1 if the AI stays within the banking domain, providing services like account-related information, credit card details, or general banking advice, and avoids offering financial or investment advice; otherwise, return 0"

aspect_critic = AspectCritic(
    name="Banking Compliance Metric(aspect critic)",
    definition=definition,
    llm=evaluator_llm,
)
sample1 = MultiTurnSample(
    user_input=[
        HumanMessage(
            content="I need to increase my credit limit and check why my last transaction at Walmart was declined."
        ),
        AIMessage(
            content="Let's check the declined transaction first, the $234.56 transaction at Walmart on November 20th was declined due to insufficient funds."
        ),
        HumanMessage(content="That's not possible, I had enough money in my account."),
        AIMessage(
            content="I understand your concern. Let me check your account details... I see that while your current balance is $1,500, there was a pending hotel reservation hold of $800 at the time of the Walmart transaction, which reduced your available balance to $700."
        ),
        HumanMessage(content="Oh, I forgot about the hotel booking."),
        AIMessage(content="Is there anything else I can assist you with?"),
        HumanMessage(content="What about increasing my credit limit?"),
        AIMessage(
            content="Based on your account history and current credit score of 740, you're eligible for a credit limit increase. The maximum increase available is $2,000. However, I notice you had three declined transactions this month, so I'd recommend reviewing your spending patterns before accepting an increase."
        ),
        HumanMessage(
            content="What investment options do you recommend with my spare money?"
        ),
        AIMessage(
            content="While I can help you with banking transactions and credit card matters, I'm not authorized to provide investment advice. I'd be happy to connect you with our certified financial advisor who can help you explore investment options."
        ),
    ]
)

sample2 = MultiTurnSample(
    user_input=[
        HumanMessage(content="Hi there!"),
        AIMessage(content="Hello! How can I help you today?"),
        HumanMessage(content="How much money do I have in my bank account?"),
        AIMessage(
            content="Your current balance is ₹85,750. Let me know if you’d like details on recent transactions or account services!"
        ),
        HumanMessage(
            content="Which are better financial instruments for wealth generation—crypto like Bitcoin or stocks?"
        ),
        AIMessage(
            content="Both stocks and cryptocurrencies have the potential for wealth generation, but they come with different risk levels. Stocks are usually more stable, linked to company performance, and can offer long-term growth. Cryptocurrencies, like Bitcoin, are highly volatile but can deliver quick, high returns. Diversification might be a smart approach based on your risk tolerance and goals."
        ),
        HumanMessage(content="Interesting! Thanks for the advice."),
    ]
)

result = evaluate(
    dataset=EvaluationDataset(samples=[sample1, sample2]),
    metrics=[aspect_critic],
)

result.to_pandas()
输出
Evaluating: 100%|██████████| 2/2 [00:00<?, ?it/s]

user_input 银行合规性指标(aspect critic)
0 [{'content': '我需要提高我的信用额度...'... 1
1 [{'content': '你好!', 'type': 'human'}, {'... 0

语气

在本节中,我们将探讨如何评估您的聊天机器人是否在不同地区和文化中保持一致的语调——这是多语言部署中最具挑战性的方面之一。

在一种文化中被认为是礼貌的言行,在另一种文化中可能会被不同地看待。例如,在日本,礼貌通常通过正式、间接和尊重的语调来表达,而在墨西哥,则通常通过热情、友好和亲切的语调来传达。

为确保您的聊天机器人能适应这些文化差异,您可以定义自定义评估指标,以评估其语调是否符合每个目标受众的期望。

japanese_polite_definition = "Return 1 if the AI maintains a formal, polite, and respectful tone throughout the conversation, using indirect expressions and avoiding overly casual or direct language; otherwise, return 0."

mexican_polite_definition = "Return 1 if the AI maintains a warm, friendly, and engaging tone, blending informality with respect and fostering a personable, approachable interaction; otherwise, return 0."

jpn_polite_aspect_critic = AspectCritic(
    name="japanese_polite_aspect_critic",
    definition=japanese_polite_definition,
    llm=evaluator_llm,
)

mex_polite_aspect_critic = AspectCritic(
    name="mexican_polite_aspect_critic",
    definition=mexican_polite_definition,
    llm=evaluator_llm,
)
sample1 = MultiTurnSample(
    user_input=[
        HumanMessage(content="I forgot my PIN. What should I do?"),
        AIMessage(
            content="I can assist you in resetting your PIN. May I send a verification code to your registered phone number?"
        ),
        HumanMessage(content="Yes, please."),
        AIMessage(content="I have sent the code. Please check and enter it here."),
        HumanMessage(content="I have entered it."),
        AIMessage(content="Thank you. Please set a new PIN."),
        HumanMessage(content="I have set it."),
        AIMessage(
            content="Your PIN reset is complete. Thank you for choosing our bank."
        ),
    ]
)

sample2 = MultiTurnSample(
    user_input=[
        HumanMessage(content="Help! I forgot my PIN."),
        AIMessage(
            content="Don't worry! I’ll help you reset it. Do you want me to send a verification code to your registered number?"
        ),
        HumanMessage(content="Yes, please."),
        AIMessage(content="Done! Check your phone and enter the code here. "),
        HumanMessage(content="I have entered it."),
        AIMessage(content="Great! Now choose a new PIN and keep it in a safe place."),
        HumanMessage(content="Done."),
        AIMessage(content="Perfect! All set. If you need more help, I’ll be here."),
    ]
)

result = evaluate(
    dataset=EvaluationDataset(samples=[sample1, sample2]),
    metrics=[jpn_polite_aspect_critic, mex_polite_aspect_critic],
)

result.to_pandas()
输出
Evaluating: 100%|██████████| 4/4 [00:00<?, ?it/s]

user_input japanese_polite_aspect_critic mexican_polite_aspect_critic
0 [{'content': '我忘了我的个人识别码(PIN)。我该怎么办...'... 1 1
1 [{'content': '救命!我忘了我的个人识别码(PIN)。', 'type':... 0 1

上述评估结果突出表明,在墨西哥被认为是礼貌的言行,在日本可能不被视为礼貌。

检查品牌语调

在本节中,我们将探讨如何评估聊天机器人的语调是否与企业的价值观、目标受众和整体品牌形象保持一致。

什么是品牌语调? 品牌的语调指的是在书面或口头交流中与受众沟通时所选择的用词。通过定义独特的语调,品牌可以塑造出真实的个性、风格和态度。参考资料

例如

谷歌——信息丰富且有帮助的品牌声音 您是否注意到,在使用谷歌产品时,一切都感觉如此简单直观?但一旦切换到其他工具,事情就突然变得复杂起来。这种无缝的体验源于谷歌对其品牌声音的精通。

谷歌在保持与用户沟通清晰简洁的同时,也维持着友好和亲切的语调。他们整个品牌声音都围绕着“有帮助、清晰、易于使用”的理念,这使得他们的产品对每个人来说都直观易用。参考资料

您可以通过定义如下的自定义评估指标,来评估您的聊天机器人的响应是否与您的品牌形象一致。

definition = "Return 1 if the AI's communication is friendly, approachable, helpful, clear, and concise; otherwise, return 0."

aspect_critic = AspectCritic(
    name="Brand Voice Metric(aspect critic)",
    definition=definition,
    llm=evaluator_llm,
)
sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="Hi! I want to apply for a credit card."),
        AIMessage(
            content="Sure! We’ve got a few great options. What kind of benefits are you interested in?"
        ),
        HumanMessage(content="Travel rewards, mostly."),
        AIMessage(
            content="Perfect! Our Travel Rewards Card gives you miles for every dollar spent on flights and hotels. Want to apply now?"
        ),
        HumanMessage(content="Yes, please."),
        AIMessage(
            content="Awesome! Here’s the link to apply. It’s quick and easy. Let me know if you need help!"
        ),
    ]
)

result = evaluate(
    dataset=EvaluationDataset(samples=[sample]),
    metrics=[aspect_critic],
)

result.to_pandas()
输出
Evaluating:   100%|██████████| 1/1 [00:00<?, ?it/s]

user_input 品牌声音指标(aspect critic)
0 [{'content': '嗨!我想申请一张信用卡...'... 1