评估多轮对话

本教程的灵感来自 Hamel 关于评估基于LLM的应用的多轮对话的笔记。目标是使用 Ragas 指标创建一个简单且可操作的评估框架，明确定义成功的对话是什么。通过本教程，您将能够根据从AI应用程序的错误分析中收集到的见解进行多轮评估。

Ragas 指标

Ragas 提供了 AspectCritic，这是一个用于评估具有二元结果的多轮对话的强大评估指标。它有助于确定对话是否符合预定义的成功标准。

AspectCritic
AspectCritic 根据用自由形式的自然语言编写的预定义方面来评估回复。它返回一个二元输出，指示回复是否与定义的方面一致。

此指标与 Hamel 关于专注于二元决策的建议一致，这消除了歧义，并为改进对话质量提供了一种清晰、可操作的方法。

实际示例 – 评估银行聊天机器人

在评估时，重点关注与用户需求直接相关的指标。分数上的任何变化都应反映对用户体验的有意义的影响。

让我们考虑一个例子，你正在为一家银行构建一个聊天机器人。

在进行错误分析后，你发现聊天机器人偶尔会忘记被要求完成的任务，或者只部分完成。为了改进聊天机器人的性能，你需要一种可靠的方法来衡量和评估这种行为。

注意：定义评分标准时，请使用标准化术语。
- 将用户的消息称为 human 消息。
- 将聊天机器人的消息称为 AI 消息。


from dotenv import load_dotenv
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

load_dotenv()

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))


from ragas.metrics import AspectCritic, SimpleCriteriaScore
from ragas.dataset_schema import MultiTurnSample, EvaluationDataset
from ragas.messages import HumanMessage, AIMessage
from ragas import evaluate


sample1 = MultiTurnSample(
    user_input=[
        HumanMessage(
            content="I need to increase my credit limit and check why my last transaction at Walmart was declined."
        ),
        AIMessage(
            content="Let's check the declined transaction first, the $234.56 transaction at Walmart on November 20th was declined due to insufficient funds."
        ),
        HumanMessage(content="That's not possible, I had enough money in my account."),
        AIMessage(
            content="I understand your concern. Let me check your account details... I see that while your current balance is $1,500, there was a pending hotel reservation hold of $800 at the time of the Walmart transaction, which reduced your available balance to $700."
        ),
        HumanMessage(content="Oh, I forgot about the hotel booking."),
        AIMessage(content="Is there anything else I can assist you with?"),
        HumanMessage(content="What about increasing my credit limit?"),
        AIMessage(
            content="Based on your account history and current credit score of 740, you're eligible for a credit limit increase. The maximum increase available is $2,000. However, I notice you had three declined transactions this month, so I'd recommend reviewing your spending patterns before accepting an increase."
        ),
    ]
)

sample2 = MultiTurnSample(
    user_input=[
        HumanMessage(
            content="I need to check why my last transaction at Target was declined."
        ),
        AIMessage(
            content="Let's look at the declined transaction. The $145.30 transaction at Target on January 5th was declined due to a suspected fraud alert."
        ),
        HumanMessage(content="That’s strange, I didn’t report anything suspicious."),
        AIMessage(
            content="I understand. Let me look deeper. It seems a fraud prevention team flagged your recent purchase at a different store for unusually high activity, which triggered the decline at Target as well."
        ),
        HumanMessage(content="Ah, that makes sense. I did shop a lot that day."),
        AIMessage(content="Is there anything else I can assist you with?"),
    ]
)


definition = "Return 1 if the AI completes all Human requests fully without any rerequests; otherwise, return 0."

aspect_critic = AspectCritic(
    name="forgetfulness_aspect_critic",
    definition=definition,
    llm=evaluator_llm,
)


result = evaluate(
    dataset=EvaluationDataset(samples=[sample1, sample2]),
    metrics=[aspect_critic],
)

result.to_pandas()

输出


Evaluating: 100%|██████████| 2/2 [00:00<?, ?it/s]

	user_input	forgetfulness_aspect_critic
0	[{'content': 'I need to increase my credit lim...	0
1	[{'content': 'I need to check why my last tran...	1

使用基于LLM的指标进行评估时，每个指标可能涉及一次或多次LLM调用。评估的跟踪记录可以提供洞察力，以理解结果并诊断任何问题。您可以通过访问此页面了解此过程的更多详细信息。

在错误分析期间发现的另一个模式是，您的银行聊天机器人有时会从讨论基本账户服务转向提供未经授权的投资建议。为了维护用户信任并确保合规性，您希望系统在对话接近这些边界时实现优雅的过渡。您可以通过定义如下指标来实现此目的。


definition = "Return 1 if the AI stays within the banking domain, providing services like account-related information, credit card details, or general banking advice, and avoids offering financial or investment advice; otherwise, return 0"

aspect_critic = AspectCritic(
    name="Banking Compliance Metric(aspect critic)",
    definition=definition,
    llm=evaluator_llm,
)


sample1 = MultiTurnSample(
    user_input=[
        HumanMessage(
            content="I need to increase my credit limit and check why my last transaction at Walmart was declined."
        ),
        AIMessage(
            content="Let's check the declined transaction first, the $234.56 transaction at Walmart on November 20th was declined due to insufficient funds."
        ),
        HumanMessage(content="That's not possible, I had enough money in my account."),
        AIMessage(
            content="I understand your concern. Let me check your account details... I see that while your current balance is $1,500, there was a pending hotel reservation hold of $800 at the time of the Walmart transaction, which reduced your available balance to $700."
        ),
        HumanMessage(content="Oh, I forgot about the hotel booking."),
        AIMessage(content="Is there anything else I can assist you with?"),
        HumanMessage(content="What about increasing my credit limit?"),
        AIMessage(
            content="Based on your account history and current credit score of 740, you're eligible for a credit limit increase. The maximum increase available is $2,000. However, I notice you had three declined transactions this month, so I'd recommend reviewing your spending patterns before accepting an increase."
        ),
        HumanMessage(
            content="What investment options do you recommend with my spare money?"
        ),
        AIMessage(
            content="While I can help you with banking transactions and credit card matters, I'm not authorized to provide investment advice. I'd be happy to connect you with our certified financial advisor who can help you explore investment options."
        ),
    ]
)

sample2 = MultiTurnSample(
    user_input=[
        HumanMessage(content="Hi there!"),
        AIMessage(content="Hello! How can I help you today?"),
        HumanMessage(content="How much money do I have in my bank account?"),
        AIMessage(
            content="Your current balance is ₹85,750. Let me know if you’d like details on recent transactions or account services!"
        ),
        HumanMessage(
            content="Which are better financial instruments for wealth generation—crypto like Bitcoin or stocks?"
        ),
        AIMessage(
            content="Both stocks and cryptocurrencies have the potential for wealth generation, but they come with different risk levels. Stocks are usually more stable, linked to company performance, and can offer long-term growth. Cryptocurrencies, like Bitcoin, are highly volatile but can deliver quick, high returns. Diversification might be a smart approach based on your risk tolerance and goals."
        ),
        HumanMessage(content="Interesting! Thanks for the advice."),
    ]
)


result = evaluate(
    dataset=EvaluationDataset(samples=[sample1, sample2]),
    metrics=[aspect_critic],
)

result.to_pandas()

输出


Evaluating: 100%|██████████| 2/2 [00:00<?, ?it/s]

	user_input	银行合规性指标(aspect critic)
0	[{'content': 'I need to increase my credit lim...	1
1	[{'content': 'Hi there!', 'type': 'human'}, {'...	0

语调

在本节中，我们将探讨如何评估您的聊天机器人是否在不同地区和文化中保持一致的语调——这是多语言部署中最具挑战性的方面之一。

在一种文化中被认为是礼貌的，在另一种文化中可能会被不同地看待。例如，日本的礼貌通常用正式、间接和尊重的语调表达，而在墨西哥，通常用热情、友好和引人入胜的语调传达。

为了确保您的聊天机器人适应这些文化细微差别，您可以定义自定义评估指标，评估语调是否与每个目标受众的期望相符。


japanese_polite_definition = "Return 1 if the AI maintains a formal, polite, and respectful tone throughout the conversation, using indirect expressions and avoiding overly casual or direct language; otherwise, return 0."

mexican_polite_definition = "Return 1 if the AI maintains a warm, friendly, and engaging tone, blending informality with respect and fostering a personable, approachable interaction; otherwise, return 0."

jpn_polite_aspect_critic = AspectCritic(
    name="japanese_polite_aspect_critic",
    definition=japanese_polite_definition,
    llm=evaluator_llm,
)

mex_polite_aspect_critic = AspectCritic(
    name="mexican_polite_aspect_critic",
    definition=mexican_polite_definition,
    llm=evaluator_llm,
)


sample1 = MultiTurnSample(
    user_input=[
        HumanMessage(content="I forgot my PIN. What should I do?"),
        AIMessage(
            content="I can assist you in resetting your PIN. May I send a verification code to your registered phone number?"
        ),
        HumanMessage(content="Yes, please."),
        AIMessage(content="I have sent the code. Please check and enter it here."),
        HumanMessage(content="I have entered it."),
        AIMessage(content="Thank you. Please set a new PIN."),
        HumanMessage(content="I have set it."),
        AIMessage(
            content="Your PIN reset is complete. Thank you for choosing our bank."
        ),
    ]
)

sample2 = MultiTurnSample(
    user_input=[
        HumanMessage(content="Help! I forgot my PIN."),
        AIMessage(
            content="Don't worry! I’ll help you reset it. Do you want me to send a verification code to your registered number?"
        ),
        HumanMessage(content="Yes, please."),
        AIMessage(content="Done! Check your phone and enter the code here. "),
        HumanMessage(content="I have entered it."),
        AIMessage(content="Great! Now choose a new PIN and keep it in a safe place."),
        HumanMessage(content="Done."),
        AIMessage(content="Perfect! All set. If you need more help, I’ll be here."),
    ]
)


result = evaluate(
    dataset=EvaluationDataset(samples=[sample1, sample2]),
    metrics=[jpn_polite_aspect_critic, mex_polite_aspect_critic],
)

result.to_pandas()

输出


Evaluating: 100%|██████████| 4/4 [00:00<?, ?it/s]

	user_input	japanese_polite_aspect_critic	mexican_polite_aspect_critic
0	[{'content': 'I forgot my PIN. What should I d...	1	1
1	[{'content': 'Help! I forgot my PIN.', 'type':...	0	1

上述评估结果突出显示，在墨西哥被认为是礼貌的，在日本可能不被认为是礼貌的。

检查品牌语调

在本节中，我们将探讨如何评估聊天机器人的语调是否与企业的价值观、目标受众和整体品牌形象一致。

什么是品牌语调？ 品牌语调是指品牌在书面或口头交流中与受众沟通时选择的词语。通过定义独特的语调，品牌可以发展出真实的个性、风格和态度。
参考

例如

Google – 信息丰富且乐于助人的品牌语调
你有没有注意到，使用谷歌产品时一切都感觉多么简单直观？但一旦切换到另一个工具，事情立刻感觉更复杂了。这种无缝体验源于谷歌对其品牌语调的精湛掌握。

谷歌保持友好和亲切的语调，同时确保用户沟通清晰简洁。他们的整体品牌语调围绕着乐于助人、清晰易懂和易于访问，使他们的产品对每个人来说都很直观。
参考

您可以通过定义如下自定义评估指标来评估您的聊天机器人的回复是否与您的品牌形象一致。


definition = "Return 1 if the AI's communication is friendly, approachable, helpful, clear, and concise; otherwise, return 0."

aspect_critic = AspectCritic(
    name="Brand Voice Metric(aspect critic)",
    definition=definition,
    llm=evaluator_llm,
)


sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="Hi! I want to apply for a credit card."),
        AIMessage(
            content="Sure! We’ve got a few great options. What kind of benefits are you interested in?"
        ),
        HumanMessage(content="Travel rewards, mostly."),
        AIMessage(
            content="Perfect! Our Travel Rewards Card gives you miles for every dollar spent on flights and hotels. Want to apply now?"
        ),
        HumanMessage(content="Yes, please."),
        AIMessage(
            content="Awesome! Here’s the link to apply. It’s quick and easy. Let me know if you need help!"
        ),
    ]
)


result = evaluate(
    dataset=EvaluationDataset(samples=[sample]),
    metrics=[aspect_critic],
)

result.to_pandas()

输出


Evaluating:   100%|██████████| 1/1 [00:00<?, ?it/s]

	user_input	品牌语调指标(aspect critic)
0	[{'content': 'Hi! I want to apply for a credit...	1