跳到内容

使用 Ragas 对 Gemini 模型进行基准测试

在本教程中,我们将使用 Ragas 度量标准在 AllenAI 的 QASPER 数据集上对 Gemini 模型进行基准测试,以完成学术问答任务。

关于数据集

QASPER (Question Answering over Scientific Papers) 是一个数据集,包含基于 1,585 篇自然语言处理研究论文的 5,049 个问题。标注者根据论文标题和摘要创建了这些问题,另一组标注者则根据论文全文文本回答了这些问题。

数据收集过程

  1. 论文选择:从 S2ORC 语料库中选择了来自 arXiv 的自然语言处理领域论文(LaTeX 格式)。
  2. 问题编写:标注者仅根据论文标题和摘要编写了现实的、信息查询性问题。
  3. 答案标注:不同的标注者审阅了整个论文,以找出答案,并选择了最少的相关证据(文本、表格、图表)。

Data collection Process of QASPER Dataset

有关 数据集 的链接以及 QASPER 的更多详细信息,请在此处查找。

加载数据集

出于演示目的,我们将使用验证拆分中的 10 个示例子集

from datasets import load_dataset
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

dataset = load_dataset("allenai/qasper", split="validation[:10]")
dataset
输出
Dataset({
    features: ['id', 'title', 'abstract', 'full_text', 'qas', 'figures_and_tables'],
    num_rows: 10
})

处理数据集

由于我们的目标是对模型在学术问答任务上的性能进行基准测试,因此我们需要 LLM 根据每篇研究论文的全文生成的答案。我们从数据集的 "full_text" 列中提取全文,并将其格式化为 markdown,将其清晰地组织成章节和段落,以提高可读性和上下文。

为了创建用于评估的问答对,我们使用数据集的 "qas" 列。此列以三种格式之一提供与答案配对的问题:抽取片段、是/否响应或自由格式答案。然后我们将这些组合成一个单独的 "golden answer" 列,该列作为评估模型性能的真实标注。

def convert_full_text_to_markdown(full_text_dict):
    """
    Converts a full_text dictionary into a markdown-formatted string.

    Expected keys:
      - "section_name": list of section titles.
      - "paragraphs": list of lists of paragraphs corresponding to each section.

    Each section becomes a markdown header (##) followed by its paragraphs.
    """
    sections = full_text_dict.get("section_name", [])
    paragraphs = full_text_dict.get("paragraphs", [])

    markdown_lines = []
    for section, paragraph in zip(sections, paragraphs):
        markdown_lines.append(f"## {section}")
        markdown_lines.append("")  # Blank line
        markdown_lines.append("\n".join(map(str, paragraph)))
        markdown_lines.append("")  # End of section
        markdown_lines.append("")  # Extra blank line for separation
    return "\n".join(markdown_lines)
def combine_responses(row):
    """
    Combines 'extractive_spans', 'yes_no', and 'free_form_answer'
    into one single string. Skips components that are missing.
    """
    responses = []
    if pd.notna(row.get("extractive_spans")):
        if isinstance(row["extractive_spans"], list):
            responses.append(" ".join(map(str, row["extractive_spans"])))
        else:
            responses.append(str(row["extractive_spans"]))
    if pd.notna(row.get("yes_no")):
        responses.append(str(row["yes_no"]))
    if pd.notna(row.get("free_form_answer")):
        responses.append(str(row["free_form_answer"]))
    return "\n".join(responses) if responses else np.nan
def preprocess_hf_dataset(hf_ds):
    """
    Processes a HuggingFace dataset split into a cleaned Pandas DataFrame.

    Steps:
      1. For each sample, convert 'full_text' to a markdown string.
      2. For every QA pair in the sample, extract the question and first answer.
      3. Build lists for answers, questions, and full_text (duplicated per question).
      4. Create a DataFrame from the collected data.
      5. Clean columns by replacing empty lists/strings with NaN and joining lists.
      6. Combine the answer components into a single 'golden response'.

    The function uses nested tqdm progress bars for real-time feedback.

    Returns:
        pd.DataFrame: The preprocessed DataFrame.
    """
    answers_list = []  # Stores the first answer for each question
    questions_list = []  # Stores each question text
    full_text_list = []  # Stores the formatted full text per QA pair

    # Outer loop: iterate over samples with progress bar
    for sample in tqdm(hf_ds, desc="Processing samples", unit="sample"):
        # Convert full text once per sample
        formatted_text = convert_full_text_to_markdown(sample["full_text"])
        # Create a list of QA pairs
        qa_pairs = list(zip(sample["qas"]["question"], sample["qas"]["answers"]))

        # Inner loop: iterate over each QA pair with its own progress bar
        for question, answer_set in tqdm(
            qa_pairs, desc="Processing QAs", total=len(qa_pairs), leave=False, unit="qa"
        ):
            answers_list.append(answer_set["answer"][0])
            questions_list.append(question)
            full_text_list.append(formatted_text)

    # Create DataFrame from the collected data
    df = pd.DataFrame(answers_list)
    df["question"] = questions_list
    df["full_text"] = full_text_list

    # Data Cleaning: Replace empty lists/strings with NaN and join lists if needed
    df["extractive_spans"] = df["extractive_spans"].apply(
        lambda x: np.nan if isinstance(x, list) and len(x) == 0 else x
    )
    df["free_form_answer"] = df["free_form_answer"].apply(
        lambda x: np.nan if isinstance(x, str) and x.strip() == "" else x
    )
    df["yes_no"] = df["yes_no"].apply(lambda x: np.nan if x is None else x)
    df["extractive_spans"] = df["extractive_spans"].apply(
        lambda x: "\n".join(x) if isinstance(x, list) else x
    )

    # Combine the answer components into a single 'golden response'
    df["golden response"] = df.apply(lambda row: combine_responses(row), axis=1)

    return df

processed_dataset = preprocess_hf_dataset(dataset)
processed_dataset.head()
Processing samples: 100%|██████████| 10/10 [00:00<00:00, 208.37sample/s]

unanswerable extractive_spans yes_no free_form_answer evidence highlighted_evidence question full_text golden response
0 False BIBREF19\nBIBREF20 NaN NaN [表 TABREF19 和 TABREF26 报告零样本... [我们将我们的方法与相关方法进行比较... 他们比较了哪些多语言方法? ## 导言\n\n尽管神经机器翻译... BIBREF19\nBIBREF20
1 False pivoting\npivoting$_{\rm m}$ NaN NaN [表 TABREF19 和 TABREF26 报告零样本... [我们将我们的方法与相关方法进行比较... 基于 pivot 的基线是什么? ## 导言\n\n尽管神经机器翻译... pivoting\npivoting$_{\rm m}$
2 False Europarl\nMultiUN NaN NaN [我们评估了我们的跨语言预训练基... [我们评估了我们的跨语言预训练基... 他们用哪些数据集做了实验? ## 导言\n\n尽管神经机器翻译... Europarl\nMultiUN
3 False NaN NaN De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... [对于 MultiUN 语料库,我们使用了四种语言:En... [对于 MultiUN 语料库,我们使用了四种语言:En... 探索了哪些语言对? ## 导言\n\n尽管神经机器翻译... De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...
4 False Stanford NER\nspaCy 2.0 \n带有循环模型的... NaN NaN [在本节中,我们将描述一些实验... [在本节中,我们将描述一些实验... 评估了哪些 NER 模型? ## 导言\n\n命名实体识别是... Stanford NER\nspaCy 2.0 \n带有循环模型的...

从 Gemini 模型生成响应

要使用 Gemini 模型生成响应,我们首先需要实例化 Google GenAI 客户端。我们将定义一个用于生成响应的 prompt 模板。

import os
from google import genai
from dotenv import load_dotenv

load_dotenv()

client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))

qa_prompt = (
    f"Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "If you cannot find answer to the query, just say that it cannot be answered.\n"
    "Query: {query_str}\n"
    "Answer: "
)

Gemini 2.0 Flash

AsyncExecutor 的代码
# async_executor.py
from __future__ import annotations
import asyncio
import time
import logging
from typing import Callable, Any, List, Tuple
from dataclasses import dataclass, field
import nest_asyncio
from tqdm import tqdm

# Apply nest_asyncio to allow nested event loops (e.g., in Jupyter)
nest_asyncio.apply()

logger = logging.getLogger(__name__)


def is_event_loop_running() -> bool:
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        return False
    else:
        return loop.is_running()


class RateLimiter:
    """
    An asynchronous rate limiter that enforces a minimum interval between calls.
    For example, with max_calls_per_minute=1250, it ensures that calls are spaced by ~0.048 seconds.
    """

    def __init__(self, max_calls_per_minute: int):
        self.interval = 60.0 / max_calls_per_minute
        self.last_call = 0.0
        self.lock = asyncio.Lock()

    async def acquire(self):
        async with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_call
            wait_time = self.interval - elapsed
            if wait_time > 0:
                await asyncio.sleep(wait_time)
            self.last_call = time.monotonic()


@dataclass
class AsyncExecutor:
    """
    An asynchronous executor similar in usage to the one in the evaluate function.

    Attributes:
        desc: Description for the progress bar.
        show_progress: Whether to display a progress bar.
        raise_exceptions: Whether to propagate exceptions.
        max_calls_per_minute: API rate limit to enforce.
    """

    desc: str = "Evaluating"
    show_progress: bool = True
    raise_exceptions: bool = False
    max_calls_per_minute: int = 1250
    jobs: List[Tuple[Callable[..., Any], tuple, dict, int]] = field(
        default_factory=list, repr=False
    )
    job_counter: int = 0
    rate_limiter: RateLimiter = field(init=False)

    def __post_init__(self):
        self.rate_limiter = RateLimiter(self.max_calls_per_minute)

    def wrap_callable_with_index(
        self, func: Callable[..., Any], index: int
    ) -> Callable[..., Any]:
        """
        Wraps an asynchronous callable so that it enforces the rate limit,
        and if an error occurs, it waits for an increasing delay (fallback)
        before retrying the function call indefinitely.
        """
        async def wrapped(*args, **kwargs) -> Tuple[int, Any]:
            retry_delay = 10  # initial delay in seconds
            while True:
                try:
                    # Enforce the API rate limit before executing the function
                    await self.rate_limiter.acquire()
                    result = await func(*args, **kwargs)
                    return index, result
                except Exception as e:
                    if self.raise_exceptions:
                        raise e
                    else:
                        logger.error(
                            "Error in job %d: %s. Retrying in %d seconds...",
                            index, e, retry_delay
                        )
                        # Wait asynchronously before retrying
                        await asyncio.sleep(retry_delay)
                        retry_delay += 5  # Increase delay for subsequent retries
        return wrapped

    def submit(self, func: Callable[..., Any], *args, **kwargs):
        """
        Submit an asynchronous job to the executor.
        """
        wrapped_func = self.wrap_callable_with_index(func, self.job_counter)
        self.jobs.append((wrapped_func, args, kwargs, self.job_counter))
        self.job_counter += 1

    async def _run_jobs(self) -> List[Any]:
        tasks = []
        # Create asyncio tasks for each job
        for wrapped_func, args, kwargs, index in self.jobs:
            tasks.append(asyncio.create_task(wrapped_func(*args, **kwargs)))

        results = [None] * len(tasks)
        if self.show_progress:
            pbar = tqdm(total=len(tasks), desc=self.desc)
            for completed in asyncio.as_completed(tasks):
                index, result = await completed
                results[index] = result
                pbar.update(1)
            pbar.close()
        else:
            for completed in asyncio.as_completed(tasks):
                index, result = await completed
                results[index] = result
        return results

    def results(self) -> List[Any]:
        """
        Execute all submitted asynchronous jobs and return their results
        in the order they were submitted.

        Thanks to nest_asyncio, this method can be used inside a Jupyter Notebook.
        """
        # If an event loop is already running, nest_asyncio allows asyncio.run() to work.
        return asyncio.run(self._run_jobs())

from async_executor import AsyncExecutor

async def query_gemini_2(query_str: str, context_str: str):
    formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
    response = await client.aio.models.generate_content(
        model="gemini-2.0-flash", contents=formatted_prompt
    )
    return response.text

# Create an instance of the asynchronous executor
executor = AsyncExecutor(
    desc="LLM Processing",
    show_progress=True,
    raise_exceptions=False,
)

for idx in range(processed_dataset.shape[0]):
    query = processed_dataset.iloc[idx]["question"]
    context = processed_dataset.iloc[idx]["full_text"]
    executor.submit(query_gemini_2, query, context)

processed_dataset["gemini_2_flash_responses"] = executor.results()
LLM Processing: 100%|██████████| 30/30 [00:04<00:00,  7.20it/s]

Gemini 1.5 Flash

from async_executor import AsyncExecutor

async def query_gemini_1_5(query_str: str, context_str: str):
    formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
    response = await client.aio.models.generate_content(
        model="gemini-1.5-flash", contents=formatted_prompt
    )
    return response.text

# Create a new instance of the asynchronous executor
executor = AsyncExecutor(
    desc="LLM Processing",
    show_progress=True,
    raise_exceptions=False,
)

for idx in range(processed_dataset.shape[0]):
    query = processed_dataset.iloc[idx]["question"]
    context = processed_dataset.iloc[idx]["full_text"]
    executor.submit(query_gemini_1_5, query, context)

processed_dataset["gemini_1_5_flash_responses"] = executor.results()
LLM Processing: 100%|██████████| 30/30 [00:05<00:00,  5.94it/s]

processed_dataset.head()
unanswerable extractive_spans yes_no free_form_answer evidence highlighted_evidence question full_text golden response gemini_2_flash_responses gemini_1_5_flash_responses
0 False BIBREF19\nBIBREF20 NaN NaN [表 TABREF19 和 TABREF26 报告零样本... [我们将我们的方法与相关方法进行比较... 他们比较了哪些多语言方法? ## 导言\n\n尽管神经机器翻译... BIBREF19\nBIBREF20 文本中提到了与多语言的比较... 该论文将其方法与多语言进行比较...
1 False pivoting\npivoting$_{\rm m}$ NaN NaN [表 TABREF19 和 TABREF26 报告零样本... [我们将我们的方法与相关方法进行比较... 基于 pivot 的基线是什么? ## 导言\n\n尽管神经机器翻译... pivoting\npivoting$_{\rm m}$ 基于 pivot 的基线是 pivoting 和 piv... 提供的文本提到了两种类型的 pivot-...
2 False Europarl\nMultiUN NaN NaN [我们评估了我们的跨语言预训练基... [我们评估了我们的跨语言预训练基... 他们用哪些数据集做了实验? ## 导言\n\n尽管神经机器翻译... Europarl\nMultiUN 他们使用 Europarl 和 MultiU 进行了实验... 实验使用了两个公共数据集:Euro...
3 False NaN NaN De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... [对于 MultiUN 语料库,我们使用了四种语言:En... [对于 MultiUN 语料库,我们使用了四种语言:En... 探索了哪些语言对? ## 导言\n\n尽管神经机器翻译... De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... 本文探索的语言对有:... 本文探索了以下语言对:...
4 False Stanford NER\nspaCy 2.0 \n带有循环模型的... NaN NaN [在本节中,我们将描述一些实验... [在本节中,我们将描述一些实验... 评估了哪些 NER 模型? ## 导言\n\n命名实体识别是... Stanford NER\nspaCy 2.0 \n带有循环模型的... 根据提供的文本,以下 NER ... Stanford NER、spaCy 2.0 和一个循环模型...

定义用于评估的度量标准

我们正在对一个问答任务进行基准测试,我们希望确保每个问题都能得到正确和准确的回答。为此,我们使用 Ragas 中的以下度量标准,您可以在此处找到 Ragas 中可用的完整度量标准列表

  • 答案准确性 (Answer Accuracy): 衡量响应与参考答案的匹配程度。
  • 答案正确性 (Answer Correctness): 评估生成的答案与参考答案之间的一致性。
  • 事实正确性 (Factual Correctness): 检查响应中的所有陈述是否都得到参考答案的支持。

对于每个问题,我们知道是否可以从提供的上下文中回答,我们想看看模型是否能正确识别。为此,我们使用 AspectCritique 定义了一个自定义的二元度量标准。

from ragas.metrics import AnswerAccuracy, AnswerCorrectness, FactualCorrectness, AspectCritic
import getpass
import os

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

aspect_critic = AspectCritic(
    name="unanswerable",
    definition="Return 1 if the query cannot be answered by the provided context, otherwise return 0.",
    llm=evaluator_llm,
)

metrics = [
    AnswerAccuracy(llm=evaluator_llm),
    AnswerCorrectness(llm=evaluator_llm, weights=[1, 0]),
    aspect_critic,
    FactualCorrectness(llm=evaluator_llm),
]

在 Ragas 度量标准上进行基准测试

我们将处理后的数据格式化为 Ragas EvaluationDataset,然后应用度量标准来评估模型性能,更多信息可在此处找到。我们将使用来自我们处理后的数据集的问题以及 Gemini 模型生成的 golden answer 响应来构建 EvaluationDataset。

Gemini 2.0 Flash

我们将为 Gemini 2.0 Flash 创建 EvaluationDataset。

from ragas.dataset_schema import EvaluationDataset

dataset_list = []

for i in range(processed_dataset.shape[0]):
    sample = {
        "user_input": (
            "" if pd.isna(processed_dataset.iloc[i].get("question")) else processed_dataset.iloc[i].get("question")
        ),
        "reference": (
            ""
            if pd.isna(processed_dataset.iloc[i].get("golden response"))
            else processed_dataset.iloc[i].get("golden response")
        ),
        "response": (
            ""
            if pd.isna(processed_dataset["gemini_2_flash_responses"].iloc[i])
            else processed_dataset["gemini_2_flash_responses"].iloc[i]
        ),
    }
    dataset_list.append(sample)

gemini_2_dataset = EvaluationDataset.from_list(dataset_list)
gemini_2_dataset.to_pandas().head()
user_input response reference
0 他们比较了哪些多语言方法? 文本中提到了与多语言的比较... BIBREF19\nBIBREF20
1 基于 pivot 的基线是什么? 基于 pivot 的基线是 pivoting 和 piv... pivoting\npivoting$_{\rm m}$
2 他们用哪些数据集做了实验? 他们使用 Europarl 和 MultiU 进行了实验... Europarl\nMultiUN
3 探索了哪些语言对? 本文探索的语言对有:... De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...
4 评估了哪些 NER 模型? 根据提供的文本,以下 NER ... Stanford NER\nspaCy 2.0 \n带有循环模型的...

现在,让我们评估 Gemini 2.0 Flash 的响应。

from ragas import evaluate

gemini_2_flash_score = evaluate(dataset=gemini_2_dataset, metrics=metrics)
gemini_2_flash_score.to_pandas().head()
Evaluating: 100%|██████████| 120/120 [00:49<00:00,  2.44it/s]

user_input response reference nv_accuracy answer_correctness unanswerable factual_correctness(mode=f1)
0 他们比较了哪些多语言方法? 文本中提到了与多语言的比较... BIBREF19\nBIBREF20 0.25 0.400000 0 0.5
1 基于 pivot 的基线是什么? 基于 pivot 的基线是 pivoting 和 piv... pivoting\npivoting$_{\rm m}$ 0.25 0.000000 0 0.0
2 他们用哪些数据集做了实验? 他们使用 Europarl 和 MultiU 进行了实验... Europarl\nMultiUN 1.00 1.000000 0 0.0
3 探索了哪些语言对? 本文探索的语言对有:... De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... 0.25 0.545455 0 0.0
4 评估了哪些 NER 模型? 根据提供的文本,以下 NER ... Stanford NER\nspaCy 2.0 \n带有循环模型的... 0.50 0.600000 0 0.0

一个完全可选的步骤,如果您想将评估结果上传到您的 Ragas 应用,您可以运行下面的命令。您可以在此处了解有关 Ragas 应用的更多信息。

gemini_2_flash_score.upload()

Gemini 1.5 Flash

接下来,我们将对 Gemini 1.5 Flash 遵循类似的步骤。

from ragas.dataset_schema import EvaluationDataset

dataset_list = []

for i in range(processed_dataset.shape[0]):
    sample = {
        "user_input": (
            "" if pd.isna(processed_dataset.iloc[i].get("question")) else processed_dataset.iloc[i].get("question")
        ),
        "reference": (
            ""
            if pd.isna(processed_dataset.iloc[i].get("golden response"))
            else processed_dataset.iloc[i].get("golden response")
        ),
        "response": (
            ""
            if pd.isna(processed_dataset["gemini_1_5_flash_responses"].iloc[i])
            else processed_dataset["gemini_1_5_flash_responses"].iloc[i]
        ),
    }
    dataset_list.append(sample)

gemini_1_5_dataset = EvaluationDataset.from_list(dataset_list)
gemini_1_5_dataset.to_pandas().head()
user_input response reference
0 他们比较了哪些多语言方法? 该论文将其方法与多语言进行比较... BIBREF19\nBIBREF20
1 基于 pivot 的基线是什么? 提供的文本提到了两种类型的 pivot-... pivoting\npivoting$_{\rm m}$
2 他们用哪些数据集做了实验? 实验使用了两个公共数据集:Euro... Europarl\nMultiUN
3 探索了哪些语言对? 本文探索了以下语言对:... De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...
4 评估了哪些 NER 模型? Stanford NER、spaCy 2.0 和一个循环模型... Stanford NER\nspaCy 2.0 \n带有循环模型的...

from ragas import evaluate

gemini_1_5_flash_score = evaluate(dataset=gemini_1_5_dataset, metrics=metrics)
gemini_1_5_flash_score.to_pandas().head()
Evaluating: 100%|██████████| 120/120 [01:02<00:00,  1.93it/s]

user_input response reference nv_accuracy answer_correctness unanswerable factual_correctness(mode=f1)
0 他们比较了哪些多语言方法? 该论文将其方法与多语言进行比较... BIBREF19\nBIBREF20 0.25 0.400000 0 0.00
1 基于 pivot 的基线是什么? 提供的文本提到了两种类型的 pivot-... pivoting\npivoting$_{\rm m}$ 0.25 0.181818 0 0.18
2 他们用哪些数据集做了实验? 实验使用了两个公共数据集:Euro... Europarl\nMultiUN 1.00 0.800000 0 0.00
3 探索了哪些语言对? 本文探索了以下语言对:... De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E... 0.00 0.533333 0 0.00
4 评估了哪些 NER 模型? Stanford NER、spaCy 2.0 和一个循环模型... Stanford NER\nspaCy 2.0 \n带有循环模型的... 0.50 0.571429 0 0.00

比较结果

现在我们已经完成了评估,让我们比较一下两个模型在学术问答方面的表现。

def print__results(result):
    result = result._repr_dict
    print("Response Accuracy:", result.get("nv_accuracy"))
    print("Answer Correctness:", result.get("answer_correctness"))
    print("Factual Correctness:", result.get("factual_correctness(mode=f1)"))

print__results(gemini_1_5_flash_score)
输出
Response Accuracy: 0.5416666666666666
Answer Correctness: 0.47723550201811066
Factual Correctness: 0.2533333333333333

print__results(gemini_2_flash_score)
输出
Response Accuracy: 0.5666666666666667
Answer Correctness: 0.48055486996663466
Factual Correctness: 0.23633333333333334

Gemini 2.0 Flash 总体表现稍好。

让我们看看模型在判断给定问题是否可以使用提供的文本回答方面的表现如何。

为此,我们将使用来自“unanswerable”度量标准的结果,并将其与我们预处理数据集中“unanswerable”列的原始真实标注进行比较。

from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score


def print_metrics(actuals, preds, model_name="Model", zero_division_value=0):
    """
    Prints common classification metrics for a given set of actual and predicted values.

    Parameters:
        actuals (array-like): Ground truth labels.
        preds (array-like): Predicted labels.
        model_name (str): Name of the model for display purposes.
        zero_division_value (int or str): Sets the value to return when there is a zero division.
                                          Options: 0, 1, or "warn" (default is 0 here).
    """
    print(f"Metrics for {model_name}:")
    print("Accuracy:", accuracy_score(actuals, preds))
    print(
        "Precision:", precision_score(actuals, preds, zero_division=zero_division_value)
    )
    print("Recall:", recall_score(actuals, preds, zero_division=zero_division_value))
    print("F1 Score:", f1_score(actuals, preds, zero_division=zero_division_value))
    print("\nClassification Report:")
    print(classification_report(actuals, preds, zero_division=zero_division_value))

gemini_1_5_flash_prediction = gemini_1_5_flash_score["unanswerable"]
gemini_2_flash_prediction = gemini_2_flash_score["unanswerable"]
groundtruth = processed_dataset["unanswerable"].astype(int)

print_metrics(groundtruth, gemini_2_flash_prediction, model_name="Gemini 2 Flash")

输出

Metrics for Gemini 2 Flash:
Accuracy: 0.9333333333333333
Precision: 0.5
Recall: 1.0
F1 Score: 0.6666666666666666

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.93      0.96        28
           1       0.50      1.00      0.67         2

    accuracy                           0.93        30
   macro avg       0.75      0.96      0.81        30
weighted avg       0.97      0.93      0.94        30

print_metrics(groundtruth, gemini_1_5_flash_prediction, model_name="Gemini 1.5 Flash")
输出
Metrics for Gemini 1.5 Flash:
Accuracy: 0.9
Precision: 0.3333333333333333
Recall: 0.5
F1 Score: 0.4

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.93      0.95        28
           1       0.33      0.50      0.40         2

    accuracy                           0.90        30
   macro avg       0.65      0.71      0.67        30
weighted avg       0.92      0.90      0.91        30

Gemini 2.0 Flash 在识别无法回答的问题方面也优于 Gemini 1.5 Flash。

下一步

您可以使用 Ragas 度量标准对任何数据集上的模型进行基准测试,只要数据集格式化为 Ragas EvaluationDataset。尝试在各种既定的基准数据集上对您的模型进行基准测试。

等等。