从您的文档中创建自定义单跳查询
加载示例文档
我正在使用来自GitLab 手册示例的文档。您可以通过运行以下命令来下载它。
from langchain_community.document_loaders import DirectoryLoader
path = "Sample_Docs_Markdown/"
loader = DirectoryLoader(path, glob="**/*.md")
docs = loader.load()
创建知识图谱
用文档创建一个基础知识图谱
from ragas.testset.graph import KnowledgeGraph
from ragas.testset.graph import Node, NodeType
kg = KnowledgeGraph()
for doc in docs:
kg.nodes.append(
Node(
type=NodeType.DOCUMENT,
properties={
"page_content": doc.page_content,
"document_metadata": doc.metadata,
},
)
)
设置 LLM 和嵌入模型
您可以使用您选择的任何模型,在这里我使用的是 open-ai 的模型。
from openai import OpenAI
from ragas.llms import llm_factory
from ragas.embeddings import OpenAIEmbeddings
openai_client = OpenAI()
llm = llm_factory("gpt-4o-mini", client=openai_client)
embedding = OpenAIEmbeddings(client=openai_client)
设置转换
这里我们使用了 2 个提取器和 2 个关系构建器。 - 标题提取器:从文档中提取标题 - 关键词提取器:从文档中提取关键词 - 标题分割器:根据标题将文档分割成节点
from ragas.testset.transforms import apply_transforms
from ragas.testset.transforms import (
HeadlinesExtractor,
HeadlineSplitter,
KeyphrasesExtractor,
)
headline_extractor = HeadlinesExtractor(llm=llm)
headline_splitter = HeadlineSplitter(min_tokens=300, max_tokens=1000)
keyphrase_extractor = KeyphrasesExtractor(
llm=llm, property_name="keyphrases", max_num=10
)
transforms = [
headline_extractor,
headline_splitter,
keyphrase_extractor,
]
apply_transforms(kg, transforms=transforms)
输出
Applying KeyphrasesExtractor: 6%| | 2/36 [00:01<00:20, 1Property 'keyphrases' already exists in node '514fdc'. Skipping!
Applying KeyphrasesExtractor: 11%| | 4/36 [00:01<00:10, 2Property 'keyphrases' already exists in node '84a0f6'. Skipping!
Applying KeyphrasesExtractor: 64%|▋| 23/36 [00:03<00:01, Property 'keyphrases' already exists in node '93f19d'. Skipping!
Applying KeyphrasesExtractor: 72%|▋| 26/36 [00:04<00:00, 1Property 'keyphrases' already exists in node 'a126bf'. Skipping!
Applying KeyphrasesExtractor: 81%|▊| 29/36 [00:04<00:00, Property 'keyphrases' already exists in node 'c230df'. Skipping!
Applying KeyphrasesExtractor: 89%|▉| 32/36 [00:04<00:00, 1Property 'keyphrases' already exists in node '4f2765'. Skipping!
Property 'keyphrases' already exists in node '4a4777'. Skipping!
配置角色
您也可以使用自动角色生成器来自动完成此操作。
from ragas.testset.persona import Persona
person1 = Persona(
name="gitlab employee",
role_description="A junior gitlab employee curious on workings on gitlab",
)
persona2 = Persona(
name="Hiring manager at gitlab",
role_description="A hiring manager at gitlab trying to underestand hiring policies in gitlab",
)
persona_list = [person1, persona2]
单跳查询
继承 SingleHopQuerySynthesizer 并修改生成查询创建场景的函数。
步骤: - 找到用于创建查询的合格节点集。这里我选择了所有已提取关键词的节点。 - 对于每个合格集 - 将关键词与一个或多个角色匹配。 - 创建(节点、角色、查询风格、查询长度)的所有可能组合 - 从这些组合中抽取所需数量的查询
from ragas.testset.synthesizers.single_hop import (
SingleHopQuerySynthesizer,
SingleHopScenario,
)
from dataclasses import dataclass
from ragas.testset.synthesizers.prompts import (
ThemesPersonasInput,
ThemesPersonasMatchingPrompt,
)
@dataclass
class MySingleHopScenario(SingleHopQuerySynthesizer):
theme_persona_matching_prompt = ThemesPersonasMatchingPrompt()
async def _generate_scenarios(self, n, knowledge_graph, persona_list, callbacks):
property_name = "keyphrases"
nodes = []
for node in knowledge_graph.nodes:
if node.type.name == "CHUNK" and node.get_property(property_name):
nodes.append(node)
number_of_samples_per_node = max(1, n // len(nodes))
scenarios = []
for node in nodes:
if len(scenarios) >= n:
break
themes = node.properties.get(property_name, [""])
prompt_input = ThemesPersonasInput(themes=themes, personas=persona_list)
persona_concepts = await self.theme_persona_matching_prompt.generate(
data=prompt_input, llm=self.llm, callbacks=callbacks
)
base_scenarios = self.prepare_combinations(
node,
themes,
personas=persona_list,
persona_concepts=persona_concepts.mapping,
)
scenarios.extend(
self.sample_combinations(base_scenarios, number_of_samples_per_node)
)
return scenarios
query = MySingleHopScenario(llm=llm)
scenarios = await query.generate_scenarios(
n=5, knowledge_graph=kg, persona_list=persona_list
)
scenarios[0]
SingleHopScenario(
nodes=1
term=what is an ally
persona=name='Hiring manager at gitlab' role_description='A hiring manager at gitlab trying to underestand hiring policies in gitlab'
style=Web search like queries
length=long)
修改提示以自定义查询风格
在这里,我将默认提示替换为仅生成“是/否”问题的指令。这是一个可选步骤。
instruction = """Generate a Yes/No query and answer based on the specified conditions (persona, term, style, length)
and the provided context. Ensure the answer is entirely faithful to the context, using only the information
directly from the provided context.
### Instructions:
1. **Generate a Yes/No Query**: Based on the context, persona, term, style, and length, create a question
that aligns with the persona's perspective, incorporates the term, and can be answered with 'Yes' or 'No'.
2. **Generate an Answer**: Using only the content from the provided context, provide a 'Yes' or 'No' answer
to the query. Do not add any information not included in or inferable from the context."""
prompt = query.get_prompts()["generate_query_reference_prompt"]
prompt.instruction = instruction
query.set_prompts(**{"generate_query_reference_prompt": prompt})
result = await query.generate_sample(scenario=scenarios[-1])