跳到内容

从您的文档创建自定义多跳查询

在本教程中,您将学习如何从您的文档创建自定义多跳查询。这是一个非常强大的功能,它允许您创建标准查询类型无法实现的查询。这也有助于您创建更符合您用例的特定查询。

加载示例文档

我正在使用来自 gitlab 手册示例 的文档。您可以通过运行以下命令下载它。

! git clone https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown
from langchain_community.document_loaders import DirectoryLoader, TextLoader

path = "Sample_Docs_Markdown/"
loader = DirectoryLoader(path, glob="**/*.md")
docs = loader.load()

创建 KG

使用文档创建基础知识图谱

from ragas.testset.graph import KnowledgeGraph
from ragas.testset.graph import Node, NodeType


kg = KnowledgeGraph()
for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={
                "page_content": doc.page_content,
                "document_metadata": doc.metadata,
            },
        )
    )

设置 LLM 和 Embedding 模型

您可以使用您选择的任何模型,这里我使用 open-ai 的模型。

from ragas.llms.base import llm_factory
from ragas.embeddings.base import embedding_factory

llm = llm_factory()
embedding = embedding_factory()

设置提取器和关系构建器

要创建多跳查询,您需要理解可以用于此的文档集。Ragas 使用文档/节点之间的关系来筛选合格的节点以创建多跳查询。具体来说,如果节点 A 和节点 B 通过某种关系(例如实体或关键词重叠)连接,那么您可以在它们之间创建多跳查询。

这里我们使用 2 个提取器和 2 个关系构建器。- Headline 提取器:从文档中提取标题 - Keyphrase 提取器:从文档中提取关键词 - Headline 分割器:根据标题将文档分割成节点 - OverlapScore 构建器:根据关键词重叠构建节点之间的关系

from ragas.testset.transforms import Parallel, apply_transforms
from ragas.testset.transforms import (
    HeadlinesExtractor,
    HeadlineSplitter,
    KeyphrasesExtractor,
    OverlapScoreBuilder,
)


headline_extractor = HeadlinesExtractor(llm=llm)
headline_splitter = HeadlineSplitter(min_tokens=300, max_tokens=1000)
keyphrase_extractor = KeyphrasesExtractor(
    llm=llm, property_name="keyphrases", max_num=10
)
relation_builder = OverlapScoreBuilder(
    property_name="keyphrases",
    new_property_name="overlap_score",
    threshold=0.01,
    distance_threshold=0.9,
)

transforms = [
    headline_extractor,
    headline_splitter,
    keyphrase_extractor,
    relation_builder,
]

apply_transforms(kg, transforms=transforms)
输出
Applying KeyphrasesExtractor:   6%|██████▏                                                                                                         | 2/36 [00:01<00:17,  1.94it/s]Property 'keyphrases' already exists in node 'a2f389'. Skipping!
Applying KeyphrasesExtractor:  17%|██████████████████▋                                                                                             | 6/36 [00:01<00:04,  6.37it/s]Property 'keyphrases' already exists in node '3068c0'. Skipping!
Applying KeyphrasesExtractor:  53%|██████████████████████████████████████████████████████████▌                                                    | 19/36 [00:02<00:01,  8.88it/s]Property 'keyphrases' already exists in node '854bf7'. Skipping!
Applying KeyphrasesExtractor:  78%|██████████████████████████████████████████████████████████████████████████████████████▎                        | 28/36 [00:03<00:00,  9.73it/s]Property 'keyphrases' already exists in node '2eeb07'. Skipping!
Property 'keyphrases' already exists in node 'd68f83'. Skipping!
Applying KeyphrasesExtractor:  83%|████████████████████████████████████████████████████████████████████████████████████████████▌                  | 30/36 [00:03<00:00,  9.35it/s]Property 'keyphrases' already exists in node '8fdbea'. Skipping!
Applying KeyphrasesExtractor:  89%|██████████████████████████████████████████████████████████████████████████████████████████████████▋            | 32/36 [00:04<00:00,  7.76it/s]Property 'keyphrases' already exists in node 'ef6ae0'. Skipping!

配置角色

您也可以使用自动角色生成器自动执行此操作

from ragas.testset.persona import Persona

person1 = Persona(
    name="gitlab employee",
    role_description="A junior gitlab employee curious on workings on gitlab",
)
persona2 = Persona(
    name="Hiring manager at gitlab",
    role_description="A hiring manager at gitlab trying to underestand hiring policies in gitlab",
)
persona_list = [person1, persona2]

创建多跳查询

继承 MultiHopQuerySynthesizer 并修改用于生成查询创建场景的函数。

步骤:- 根据节点之间的关系找到合格的 (nodeA, relationship, nodeB) 集合 - 对于每个合格的集合 - 将关键词与一个或多个角色匹配。- 创建 (节点, 角色, 查询风格, 查询长度) 的所有可能组合 - 从组合中采样所需数量的查询

from dataclasses import dataclass
import typing as t
from ragas.testset.synthesizers.multi_hop.base import (
    MultiHopQuerySynthesizer,
    MultiHopScenario,
)
from ragas.testset.synthesizers.prompts import (
    ThemesPersonasInput,
    ThemesPersonasMatchingPrompt,
)


@dataclass
class MyMultiHopQuery(MultiHopQuerySynthesizer):

    theme_persona_matching_prompt = ThemesPersonasMatchingPrompt()

    async def _generate_scenarios(
        self,
        n: int,
        knowledge_graph,
        persona_list,
        callbacks,
    ) -> t.List[MultiHopScenario]:

        # query and get (node_a, rel, node_b) to create multi-hop queries
        results = kg.find_two_nodes_single_rel(
            relationship_condition=lambda rel: (
                True if rel.type == "keyphrases_overlap" else False
            )
        )

        num_sample_per_triplet = max(1, n // len(results))

        scenarios = []
        for triplet in results:
            if len(scenarios) < n:
                node_a, node_b = triplet[0], triplet[-1]
                overlapped_keywords = triplet[1].properties["overlapped_items"]
                if overlapped_keywords:

                    # match the keyword with a persona for query creation
                    themes = list(dict(overlapped_keywords).keys())
                    prompt_input = ThemesPersonasInput(
                        themes=themes, personas=persona_list
                    )
                    persona_concepts = (
                        await self.theme_persona_matching_prompt.generate(
                            data=prompt_input, llm=self.llm, callbacks=callbacks
                        )
                    )

                    overlapped_keywords = [list(item) for item in overlapped_keywords]

                    # prepare and sample possible combinations
                    base_scenarios = self.prepare_combinations(
                        [node_a, node_b],
                        overlapped_keywords,
                        personas=persona_list,
                        persona_item_mapping=persona_concepts.mapping,
                        property_name="keyphrases",
                    )

                    # get number of required samples from this triplet
                    base_scenarios = self.sample_diverse_combinations(
                        base_scenarios, num_sample_per_triplet
                    )

                    scenarios.extend(base_scenarios)

        return scenarios

query = MyMultiHopQuery(llm=llm)
scenarios = await query.generate_scenarios(
    n=10, knowledge_graph=kg, persona_list=persona_list
)

scenarios[4]
输出
MultiHopScenario(
nodes=2
combinations=['Diversity Inclusion & Belonging', 'Diversity, Inclusion & Belonging Goals']
style=Web search like queries
length=short
persona=name='Hiring manager at gitlab' role_description='A hiring manager at gitlab trying to underestand hiring policies in gitlab')

运行多跳查询

result = await query.generate_sample(scenario=scenarios[-1])
result.user_input

输出

'How does GitLab ensure that its DIB roundtables are effective in promoting diversity and inclusion?'

耶!您已经创建了一个多跳查询。现在您可以通过创建和探索文档之间的关系来创建任何此类查询。