从您的文档创建自定义多跳查询
在本教程中,您将学习如何从您的文档创建自定义多跳查询。这是一个非常强大的功能,它允许您创建标准查询类型无法实现的查询。这也有助于您创建更符合您用例的特定查询。
加载示例文档
我正在使用来自 gitlab 手册示例 的文档。您可以通过运行以下命令下载它。
from langchain_community.document_loaders import DirectoryLoader, TextLoader
path = "Sample_Docs_Markdown/"
loader = DirectoryLoader(path, glob="**/*.md")
docs = loader.load()
创建 KG
使用文档创建基础知识图谱
from ragas.testset.graph import KnowledgeGraph
from ragas.testset.graph import Node, NodeType
kg = KnowledgeGraph()
for doc in docs:
kg.nodes.append(
Node(
type=NodeType.DOCUMENT,
properties={
"page_content": doc.page_content,
"document_metadata": doc.metadata,
},
)
)
设置 LLM 和 Embedding 模型
您可以使用您选择的任何模型,这里我使用 open-ai 的模型。
from ragas.llms.base import llm_factory
from ragas.embeddings.base import embedding_factory
llm = llm_factory()
embedding = embedding_factory()
设置提取器和关系构建器
要创建多跳查询,您需要理解可以用于此的文档集。Ragas 使用文档/节点之间的关系来筛选合格的节点以创建多跳查询。具体来说,如果节点 A 和节点 B 通过某种关系(例如实体或关键词重叠)连接,那么您可以在它们之间创建多跳查询。
这里我们使用 2 个提取器和 2 个关系构建器。- Headline 提取器:从文档中提取标题 - Keyphrase 提取器:从文档中提取关键词 - Headline 分割器:根据标题将文档分割成节点 - OverlapScore 构建器:根据关键词重叠构建节点之间的关系
from ragas.testset.transforms import Parallel, apply_transforms
from ragas.testset.transforms import (
HeadlinesExtractor,
HeadlineSplitter,
KeyphrasesExtractor,
OverlapScoreBuilder,
)
headline_extractor = HeadlinesExtractor(llm=llm)
headline_splitter = HeadlineSplitter(min_tokens=300, max_tokens=1000)
keyphrase_extractor = KeyphrasesExtractor(
llm=llm, property_name="keyphrases", max_num=10
)
relation_builder = OverlapScoreBuilder(
property_name="keyphrases",
new_property_name="overlap_score",
threshold=0.01,
distance_threshold=0.9,
)
transforms = [
headline_extractor,
headline_splitter,
keyphrase_extractor,
relation_builder,
]
apply_transforms(kg, transforms=transforms)
Applying KeyphrasesExtractor: 6%|██████▏ | 2/36 [00:01<00:17, 1.94it/s]Property 'keyphrases' already exists in node 'a2f389'. Skipping!
Applying KeyphrasesExtractor: 17%|██████████████████▋ | 6/36 [00:01<00:04, 6.37it/s]Property 'keyphrases' already exists in node '3068c0'. Skipping!
Applying KeyphrasesExtractor: 53%|██████████████████████████████████████████████████████████▌ | 19/36 [00:02<00:01, 8.88it/s]Property 'keyphrases' already exists in node '854bf7'. Skipping!
Applying KeyphrasesExtractor: 78%|██████████████████████████████████████████████████████████████████████████████████████▎ | 28/36 [00:03<00:00, 9.73it/s]Property 'keyphrases' already exists in node '2eeb07'. Skipping!
Property 'keyphrases' already exists in node 'd68f83'. Skipping!
Applying KeyphrasesExtractor: 83%|████████████████████████████████████████████████████████████████████████████████████████████▌ | 30/36 [00:03<00:00, 9.35it/s]Property 'keyphrases' already exists in node '8fdbea'. Skipping!
Applying KeyphrasesExtractor: 89%|██████████████████████████████████████████████████████████████████████████████████████████████████▋ | 32/36 [00:04<00:00, 7.76it/s]Property 'keyphrases' already exists in node 'ef6ae0'. Skipping!
配置角色
您也可以使用自动角色生成器自动执行此操作
from ragas.testset.persona import Persona
person1 = Persona(
name="gitlab employee",
role_description="A junior gitlab employee curious on workings on gitlab",
)
persona2 = Persona(
name="Hiring manager at gitlab",
role_description="A hiring manager at gitlab trying to underestand hiring policies in gitlab",
)
persona_list = [person1, persona2]
创建多跳查询
继承 MultiHopQuerySynthesizer
并修改用于生成查询创建场景的函数。
步骤:- 根据节点之间的关系找到合格的 (nodeA, relationship, nodeB) 集合 - 对于每个合格的集合 - 将关键词与一个或多个角色匹配。- 创建 (节点, 角色, 查询风格, 查询长度) 的所有可能组合 - 从组合中采样所需数量的查询
from dataclasses import dataclass
import typing as t
from ragas.testset.synthesizers.multi_hop.base import (
MultiHopQuerySynthesizer,
MultiHopScenario,
)
from ragas.testset.synthesizers.prompts import (
ThemesPersonasInput,
ThemesPersonasMatchingPrompt,
)
@dataclass
class MyMultiHopQuery(MultiHopQuerySynthesizer):
theme_persona_matching_prompt = ThemesPersonasMatchingPrompt()
async def _generate_scenarios(
self,
n: int,
knowledge_graph,
persona_list,
callbacks,
) -> t.List[MultiHopScenario]:
# query and get (node_a, rel, node_b) to create multi-hop queries
results = kg.find_two_nodes_single_rel(
relationship_condition=lambda rel: (
True if rel.type == "keyphrases_overlap" else False
)
)
num_sample_per_triplet = max(1, n // len(results))
scenarios = []
for triplet in results:
if len(scenarios) < n:
node_a, node_b = triplet[0], triplet[-1]
overlapped_keywords = triplet[1].properties["overlapped_items"]
if overlapped_keywords:
# match the keyword with a persona for query creation
themes = list(dict(overlapped_keywords).keys())
prompt_input = ThemesPersonasInput(
themes=themes, personas=persona_list
)
persona_concepts = (
await self.theme_persona_matching_prompt.generate(
data=prompt_input, llm=self.llm, callbacks=callbacks
)
)
overlapped_keywords = [list(item) for item in overlapped_keywords]
# prepare and sample possible combinations
base_scenarios = self.prepare_combinations(
[node_a, node_b],
overlapped_keywords,
personas=persona_list,
persona_item_mapping=persona_concepts.mapping,
property_name="keyphrases",
)
# get number of required samples from this triplet
base_scenarios = self.sample_diverse_combinations(
base_scenarios, num_sample_per_triplet
)
scenarios.extend(base_scenarios)
return scenarios
query = MyMultiHopQuery(llm=llm)
scenarios = await query.generate_scenarios(
n=10, knowledge_graph=kg, persona_list=persona_list
)
scenarios[4]
MultiHopScenario(
nodes=2
combinations=['Diversity Inclusion & Belonging', 'Diversity, Inclusion & Belonging Goals']
style=Web search like queries
length=short
persona=name='Hiring manager at gitlab' role_description='A hiring manager at gitlab trying to underestand hiring policies in gitlab')
运行多跳查询
输出
'How does GitLab ensure that its DIB roundtables are effective in promoting diversity and inclusion?'
耶!您已经创建了一个多跳查询。现在您可以通过创建和探索文档之间的关系来创建任何此类查询。