跳转到内容

评估

evaluate()

使用不同的指标对数据集进行评估

参数

名称 类型 描述 默认值
dataset Dataset, EvaluationDataset

指标用于评估 RAG 管道的数据集。

必需
metrics list[Metric]

用于评估的指标列表。如果未提供,ragas 将运行一组最佳指标进行评估,以提供全面的视图。

None
llm BaseRagasLLM

用于生成分数以计算指标的语言模型(LLM)。如果未提供,ragas 将为需要 LLM 的指标使用默认的语言模型。这可以被指标级别中通过 metric.llm 指定的 LLM 覆盖。

None
embeddings BaseRagasEmbeddings

用于指标的嵌入模型。如果未提供,ragas 将为需要嵌入的指标使用默认的嵌入模型。这可以被指标级别中通过 metric.embeddings 指定的嵌入模型覆盖。

None
experiment_name str

要跟踪的实验名称。这用于在跟踪工具中跟踪评估。

None
callbacks Callbacks

在评估期间运行的生命周期 Langchain 回调。更多信息请查看 Langchain 文档

None
run_config RunConfig

用于运行时设置(如超时和重试)的配置。如果未提供,则使用默认值。

None
token_usage_parser TokenUsageParser

用于从 LLM 结果中获取令牌(token)使用情况的解析器。如果未提供,则不会计算成本和总令牌数。默认为 None。

None
raise_exceptions False

是否抛出异常。如果设置为 True,当任何指标失败时,评估将抛出异常。如果设置为 False,评估将为失败的行返回 np.nan。默认为 False。

False
column_map dict[str, str]

用于评估的数据集的列名。如果数据集的列名与默认的不同,可以在此处以字典形式提供映射。例如:如果数据集的列名为 contexts_v1,可以传入 {"contexts": "contexts_v1"} 作为 column_map。

None
show_progress bool

是否在评估期间显示进度条。如果设置为 False,进度条将被禁用。默认为 True。

True
batch_size int

批次应该有多大。如果设置为 None(默认值),则不进行批处理。

None
return_executor bool

如果为 True,则返回 Executor 实例而不是运行评估。返回的执行器可用于通过调用 executor.cancel() 来取消执行。要获取结果,请调用 executor.results()。默认为 False。

False
allow_nest_asyncio bool

是否允许为 Jupyter 兼容性而打 nest_asyncio 补丁。在生产环境的异步应用程序中设置为 False,以避免事件循环冲突。默认为 True。

True

返回

类型 描述
EvaluationResultExecutor

如果 return_executor 为 False,则返回包含每个指标分数的 EvaluationResult 对象。如果 return_executor 为 True,则返回 Executor 实例以进行可取消的执行。

抛出

类型 描述
ValueError

如果因为指标所需的列缺失或列格式错误而导致验证失败。

示例

基本用法如下

from ragas import evaluate

>>> dataset
Dataset({
    features: ['question', 'ground_truth', 'answer', 'contexts'],
    num_rows: 30
})

>>> result = evaluate(dataset)
>>> print(result)
{'context_precision': 0.817,
'faithfulness': 0.892,
'answer_relevancy': 0.874}

源代码位于 src/ragas/evaluation.py
@track_was_completed
def evaluate(
    dataset: t.Union[Dataset, EvaluationDataset],
    metrics: t.Optional[t.Sequence[Metric]] = None,
    llm: t.Optional[BaseRagasLLM | LangchainLLM] = None,
    embeddings: t.Optional[
        BaseRagasEmbeddings | BaseRagasEmbedding | LangchainEmbeddings
    ] = None,
    experiment_name: t.Optional[str] = None,
    callbacks: Callbacks = None,
    run_config: t.Optional[RunConfig] = None,
    token_usage_parser: t.Optional[TokenUsageParser] = None,
    raise_exceptions: bool = False,
    column_map: t.Optional[t.Dict[str, str]] = None,
    show_progress: bool = True,
    batch_size: t.Optional[int] = None,
    _run_id: t.Optional[UUID] = None,
    _pbar: t.Optional[tqdm] = None,
    return_executor: bool = False,
    allow_nest_asyncio: bool = True,
) -> t.Union[EvaluationResult, Executor]:
    """
    Perform the evaluation on the dataset with different metrics

    Parameters
    ----------
    dataset : Dataset, EvaluationDataset
        The dataset used by the metrics to evaluate the RAG pipeline.
    metrics : list[Metric], optional
        List of metrics to use for evaluation. If not provided, ragas will run
        the evaluation on the best set of metrics to give a complete view.
    llm : BaseRagasLLM, optional
        The language model (LLM) to use to generate the score for calculating the metrics.
        If not provided, ragas will use the default
        language model for metrics that require an LLM. This can be overridden by the LLM
        specified in the metric level with `metric.llm`.
    embeddings : BaseRagasEmbeddings, optional
        The embeddings model to use for the metrics.
        If not provided, ragas will use the default embeddings for metrics that require embeddings.
        This can be overridden by the embeddings specified in the metric level with `metric.embeddings`.
    experiment_name : str, optional
        The name of the experiment to track. This is used to track the evaluation in the tracing tool.
    callbacks : Callbacks, optional
        Lifecycle Langchain Callbacks to run during evaluation.
        Check the [Langchain documentation](https://python.langchain.ac.cn/docs/modules/callbacks/) for more information.
    run_config : RunConfig, optional
        Configuration for runtime settings like timeout and retries. If not provided, default values are used.
    token_usage_parser : TokenUsageParser, optional
        Parser to get the token usage from the LLM result.
        If not provided, the cost and total token count will not be calculated. Default is None.
    raise_exceptions : False
        Whether to raise exceptions or not. If set to True, the evaluation will raise an exception
        if any of the metrics fail. If set to False, the evaluation will return `np.nan` for the row that failed. Default is False.
    column_map : dict[str, str], optional
        The column names of the dataset to use for evaluation. If the column names of the dataset are different from the default ones,
        it is possible to provide the mapping as a dictionary here. Example: If the dataset column name is `contexts_v1`, it is possible to pass column_map as `{"contexts": "contexts_v1"}`.
    show_progress : bool, optional
        Whether to show the progress bar during evaluation. If set to False, the progress bar will be disabled. The default is True.
    batch_size : int, optional
        How large the batches should be. If set to None (default), no batching is done.
    return_executor : bool, optional
        If True, returns the Executor instance instead of running evaluation.
        The returned executor can be used to cancel execution by calling executor.cancel().
        To get results, call executor.results(). Default is False.
    allow_nest_asyncio : bool, optional
        Whether to allow nest_asyncio patching for Jupyter compatibility.
        Set to False in production async applications to avoid event loop conflicts. Default is True.

    Returns
    -------
    EvaluationResult or Executor
        If return_executor is False, returns EvaluationResult object containing the scores of each metric.
        If return_executor is True, returns the Executor instance for cancellable execution.

    Raises
    ------
    ValueError
        if validation fails because the columns required for the metrics are missing or
        if the columns are of the wrong format.

    Examples
    --------
    the basic usage is as follows:
    ```
    from ragas import evaluate

    >>> dataset
    Dataset({
        features: ['question', 'ground_truth', 'answer', 'contexts'],
        num_rows: 30
    })

    >>> result = evaluate(dataset)
    >>> print(result)
    {'context_precision': 0.817,
    'faithfulness': 0.892,
    'answer_relevancy': 0.874}
    ```
    """

    # Create async wrapper for aevaluate
    async def _async_wrapper():
        return await aevaluate(
            dataset=dataset,
            metrics=metrics,
            llm=llm,
            embeddings=embeddings,
            experiment_name=experiment_name,
            callbacks=callbacks,
            run_config=run_config,
            token_usage_parser=token_usage_parser,
            raise_exceptions=raise_exceptions,
            column_map=column_map,
            show_progress=show_progress,
            batch_size=batch_size,
            _run_id=_run_id,
            _pbar=_pbar,
            return_executor=return_executor,
        )

    if not allow_nest_asyncio:
        # Run without nest_asyncio - creates a new event loop
        import asyncio

        return asyncio.run(_async_wrapper())
    else:
        # Default behavior: use nest_asyncio for backward compatibility (Jupyter notebooks)
        from ragas.async_utils import run

        return run(_async_wrapper())