从您的文档创建自定义多跳查询
在本教程中,您将学习如何从您的文档创建自定义多跳查询。这是一项非常强大的功能,允许您创建标准查询类型无法实现的查询。这也有助于您创建更符合您特定用例的查询。
加载示例文档
我正在使用来自 GitLab 手册示例 的文档。您可以通过运行以下命令来下载它。
from langchain_community.document_loaders import DirectoryLoader, TextLoader
path = "Sample_Docs_Markdown/"
loader = DirectoryLoader(path, glob="**/*.md")
docs = loader.load()
创建知识图谱 (KG)
用文档创建一个基础知识图谱
from ragas.testset.graph import KnowledgeGraph
from ragas.testset.graph import Node, NodeType
kg = KnowledgeGraph()
for doc in docs:
kg.nodes.append(
Node(
type=NodeType.DOCUMENT,
properties={
"page_content": doc.page_content,
"document_metadata": doc.metadata,
},
)
)
设置 LLM 和嵌入模型
您可以使用您选择的任何模型,这里我使用的是来自 open-ai 的模型。
from openai import OpenAI
from ragas.llms import llm_factory
from ragas.embeddings import OpenAIEmbeddings
openai_client = OpenAI()
llm = llm_factory("gpt-4o-mini", client=openai_client)
embedding = OpenAIEmbeddings(client=openai_client)
设置提取器和关系构建器
要创建多跳查询,您需要了解可用于此目的的文档集。Ragas 使用文档/节点之间的关系来限定用于创建多跳查询的节点。具体来说,如果节点 A 和节点 B 通过某种关系(例如实体或关键词重叠)连接,那么您可以在它们之间创建多跳查询。
在这里,我们使用了 2 个提取器和 2 个关系构建器。 - 标题提取器:从文档中提取标题 - 关键词提取器:从文档中提取关键词 - 标题分割器:根据标题将文档分割成节点 - 重叠分数构建器:根据关键词重叠在节点之间建立关系
from ragas.testset.transforms import Parallel, apply_transforms
from ragas.testset.transforms import (
HeadlinesExtractor,
HeadlineSplitter,
KeyphrasesExtractor,
OverlapScoreBuilder,
)
headline_extractor = HeadlinesExtractor(llm=llm)
headline_splitter = HeadlineSplitter(min_tokens=300, max_tokens=1000)
keyphrase_extractor = KeyphrasesExtractor(
llm=llm, property_name="keyphrases", max_num=10
)
relation_builder = OverlapScoreBuilder(
property_name="keyphrases",
new_property_name="overlap_score",
threshold=0.01,
distance_threshold=0.9,
)
transforms = [
headline_extractor,
headline_splitter,
keyphrase_extractor,
relation_builder,
]
apply_transforms(kg, transforms=transforms)
Applying KeyphrasesExtractor: 6%|██████▏ | 2/36 [00:01<00:17, 1.94it/s]Property 'keyphrases' already exists in node 'a2f389'. Skipping!
Applying KeyphrasesExtractor: 17%|██████████████████▋ | 6/36 [00:01<00:04, 6.37it/s]Property 'keyphrases' already exists in node '3068c0'. Skipping!
Applying KeyphrasesExtractor: 53%|██████████████████████████████████████████████████████████▌ | 19/36 [00:02<00:01, 8.88it/s]Property 'keyphrases' already exists in node '854bf7'. Skipping!
Applying KeyphrasesExtractor: 78%|██████████████████████████████████████████████████████████████████████████████████████▎ | 28/36 [00:03<00:00, 9.73it/s]Property 'keyphrases' already exists in node '2eeb07'. Skipping!
Property 'keyphrases' already exists in node 'd68f83'. Skipping!
Applying KeyphrasesExtractor: 83%|████████████████████████████████████████████████████████████████████████████████████████████▌ | 30/36 [00:03<00:00, 9.35it/s]Property 'keyphrases' already exists in node '8fdbea'. Skipping!
Applying KeyphrasesExtractor: 89%|██████████████████████████████████████████████████████████████████████████████████████████████████▋ | 32/36 [00:04<00:00, 7.76it/s]Property 'keyphrases' already exists in node 'ef6ae0'. Skipping!
配置角色画像 (personas)
您也可以使用自动角色画像生成器自动完成此操作。
from ragas.testset.persona import Persona
person1 = Persona(
name="gitlab employee",
role_description="A junior gitlab employee curious on workings on gitlab",
)
persona2 = Persona(
name="Hiring manager at gitlab",
role_description="A hiring manager at gitlab trying to underestand hiring policies in gitlab",
)
persona_list = [person1, persona2]
创建多跳查询
继承 MultiHopQuerySynthesizer 并修改生成查询创建场景的函数。
步骤: - 根据节点之间的关系,找到合格的 (节点A, 关系, 节点B) 集合 - 对于每个合格的集合 - 将关键词与一个或多个角色画像进行匹配。 - 创建 (节点, 角色画像, 查询风格, 查询长度) 的所有可能组合 - 从这些组合中采样所需数量的查询
from dataclasses import dataclass
import typing as t
from ragas.testset.synthesizers.multi_hop.base import (
MultiHopQuerySynthesizer,
MultiHopScenario,
)
from ragas.testset.synthesizers.prompts import (
ThemesPersonasInput,
ThemesPersonasMatchingPrompt,
)
@dataclass
class MyMultiHopQuery(MultiHopQuerySynthesizer):
theme_persona_matching_prompt = ThemesPersonasMatchingPrompt()
async def _generate_scenarios(
self,
n: int,
knowledge_graph,
persona_list,
callbacks,
) -> t.List[MultiHopScenario]:
# query and get (node_a, rel, node_b) to create multi-hop queries
results = kg.find_two_nodes_single_rel(
relationship_condition=lambda rel: (
True if rel.type == "keyphrases_overlap" else False
)
)
num_sample_per_triplet = max(1, n // len(results))
scenarios = []
for triplet in results:
if len(scenarios) < n:
node_a, node_b = triplet[0], triplet[-1]
overlapped_keywords = triplet[1].properties["overlapped_items"]
if overlapped_keywords:
# match the keyword with a persona for query creation
themes = list(dict(overlapped_keywords).keys())
prompt_input = ThemesPersonasInput(
themes=themes, personas=persona_list
)
persona_concepts = (
await self.theme_persona_matching_prompt.generate(
data=prompt_input, llm=self.llm, callbacks=callbacks
)
)
overlapped_keywords = [list(item) for item in overlapped_keywords]
# prepare and sample possible combinations
base_scenarios = self.prepare_combinations(
[node_a, node_b],
overlapped_keywords,
personas=persona_list,
persona_item_mapping=persona_concepts.mapping,
property_name="keyphrases",
)
# get number of required samples from this triplet
base_scenarios = self.sample_diverse_combinations(
base_scenarios, num_sample_per_triplet
)
scenarios.extend(base_scenarios)
return scenarios
query = MyMultiHopQuery(llm=llm)
scenarios = await query.generate_scenarios(
n=10, knowledge_graph=kg, persona_list=persona_list
)
scenarios[4]
MultiHopScenario(
nodes=2
combinations=['Diversity Inclusion & Belonging', 'Diversity, Inclusion & Belonging Goals']
style=Web search like queries
length=short
persona=name='Hiring manager at gitlab' role_description='A hiring manager at gitlab trying to underestand hiring policies in gitlab')
运行多跳查询
输出
'How does GitLab ensure that its DIB roundtables are effective in promoting diversity and inclusion?'
太棒了!您已经创建了一个多跳查询。现在,您可以通过创建和探索文档之间的关系来创建任何此类查询。