graphRAG后的triplex、itext2KG与nano-graphRAG对比介绍

发布日期：2024-09-14 14:34:25 浏览次数： 4189

作者：许同学说

微信搜一搜，关注“许同学说”

引言

书接上回，自微软的graphRAG推出后，RAG便开始分为了两条道路，一条是以显式检索为基座，对其提高召回与长记忆模块来尽可能达到概括关联的能力，而另一条则是根据大模型提取KG元组的关系来提升QFS（Query-Focused Summarization）任务的能力，相对而言，前者依然是百花齐放，因为技术难度较低，有比较完整的道路，每个月都有十几篇我没看也不想看的RAG方法，感觉最近比较有意思的两种，是混合graph和Vector的HybridRAG 与基于长记忆检索增强的MemoRAG，它们都提供了开源方案和模型，可以尝试。后者目前基于graphRAG改进，我感觉更多在于降本增效，因为graphRAG虽然安装简单，但开发量太大了，并且成本消耗恐怖，具体可以看我上回基于graphRAG和gpt4omini的知识库与目前主流RAG的对比实验记录最后总结，本篇也主要是从这个角度来分析比较有意思的一些方案。

INSPIRATION

nano-graphRAG

开源repo：https://github.com/gusye1234/nano-graphrag

unsetunsetnano 架构设计unsetunset

该项目是一个号称将微软的graphRAG降本增效至只有800行代码，并在保留核心功能上提高了速度，根据作者提供的benchmark数据，似乎效果确实还不错，该项目的安装与微软一致，为：

# version 1: Install from PyPi
pip install nano-graphrag

# version 2: Install from source
git clone https://github.com/gusye1234/nano-graphrag.git
cd nano-graphrag
pip install -e .

个人比较推荐源码安装，最近好像更新得很频繁，我在记录该模块的同时，去看git说明，发现又冒出来了一堆东西，看future roadmap，作者想要接入HybridRAG，所以不仅仅创建了GraphStorage，还有KVStorage：

这里的入口基本上是从insert出去，代码里有一个Naive_rag的参数，为true的时候，会开启传统召回模式，即除了对chunk做提取生成后续的Entity和Edge，还会存在一个chunks_vdb向量库。另外，可以从上图中看到，使用的是asyncio异步的 always_get_an_event_loop() 来调度和执行异步任务，保证了效率。

后续部分的写入，排除掉最近一段时间基于向量加的一些逻辑操作，主要为NetworkXStorage类，可以理解为图数据库，包含对实体与关系的聚类算法与生成报告业务等一系列操作：

至于graphRAG的Local Search和Global Search能力代码，具体可以看_op.py文件，代码太多我还没看完，大致的所有功能我画为下图：

后续有时间了，并且要二开了，再回头研究研究，这里不管各种功能函数，官方示例调用是非常简单的，即：

from nano_graphrag import GraphRAG, QueryParam

graph_func = GraphRAG(working_dir="./dickens")

with open("./book.txt") as f:
    graph_func.insert(f.read())

# Perform global graphrag search
print(graph_func.query("What are the top themes in this story?"))

# Perform local graphrag search (I think is better and more scalable one)
print(graph_func.query("What are the top themes in this story?", param=QueryParam(mode="local")))

我比较想说明的是leiden算法，流程为：

根据Wikipedia说明，Leiden算法是一种用于社区检测的算法，它是为了解决Louvain算法在社区连接性上的一些限制而提出的。Louvain算法在社区检测中可能会产生连接性不好的社区，甚至可能出现不连通的社区，Leiden算法通过引入一些改进，确保了生成的社区是内部连通的，并且在迭代过程中提供了一些明确的保证和界限。在nano中，该算法大致代码为：

def _cluster_data_to_subgraphs(self, cluster_data: dict[str, list[dict[str, str]]]):
    for node_id, clusters in cluster_data.items():
        self._graph.nodes[node_id]["clusters"] = json.dumps(clusters)

async def _leiden_clustering(self):
    from graspologic.partition import hierarchical_leiden

    graph = NetworkXStorage.stable_largest_connected_component(self._graph)
    community_mapping = hierarchical_leiden(
        graph,
        max_cluster_size=self.global_config["max_graph_cluster_size"],
        random_seed=self.global_config["graph_cluster_seed"],
    )

    node_communities: dict[str, list[dict[str, str]]] = defaultdict(list)
    __levels = defaultdict(set)
    for partition in community_mapping:
        level_key = partition.level
        cluster_id = partition.cluster
        node_communities[partition.node].append(
            {"level": level_key, "cluster": cluster_id}
        )
        __levels[level_key].add(cluster_id)
    node_communities = dict(node_communities)
    __levels = {k: len(v) for k, v in __levels.items()}
    logger.info(f"Each level has communities: {dict(__levels)}")
    self._cluster_data_to_subgraphs(node_communities)

这里存在一个问题，当聚类子群比较多，llm模型的输入token是有限的，我看好像没有对这部分做处理，而是直接做了截断，那么对结果将会产生一定影响。

unsetunsetnano 对比实验unsetunset

最后，本来是想像上篇一样，做一个与原方案的对比实验，但看作者好像已经做了，我也没有特别必要的需求，就以作者的benchmark作为说明，实验环境为：

Both GraphRAG andnano-graphrag use OpenAI Embedding and gpt-4o.
Not Cache for both.
On the same device and network connection.
GrapgRAG Max Async API request: 25
nano-graphrag Max Async API request: 16

Local Search Results的结果看起来问题不大，Global Search Results的回答就有点出入了。问题为：

"大乔与曹操的关系"

graphRAG的结果为：

### 大乔与曹操的关系

大乔（Da Qiao）是乔公的女儿，以其非凡的美貌而闻名。这一特质使她在社区中成为一个重要人物，尤其是在曹操（Cao Cao）对她的渴望背景下 [Data: Reports (41)]。

#### 曹操的计划

曹操计划在统一江南后，将大乔安置在铜雀台（Tongque Tai）。铜雀台是曹操在叶县张河上建造的一座宏伟平台，象征着好运和庆祝。曹操打算将包括大乔在内的美丽女子安置在铜雀台中 [Data: Reports (269)]。这一计划不仅反映了曹操的野心和愿景，也显示了他对大乔的特别关注。

#### 乔公与曹操的关系

乔公是大乔的父亲，同时也是曹操的朋友。这层关系不仅突显了乔公的影响力，还暗示了由于曹操对大乔的渴望，可能会产生潜在的杠杆作用或冲突 [Data: Reports (41)]。

### 结论

综上所述，大乔与曹操的关系主要体现在曹操对她的渴望以及他计划将她安置在铜雀台的意图上。此外，乔公与曹操的友谊也为这一关系增添了复杂性，可能会在未来引发更多的动态变化 [Data: Reports (41, 269)]。

nano-graphRAG的结果为：

## 曹操与大乔的关系

曹操对大乔展现了显著的政治和战略兴趣，这部分是由于大乔的美貌以及她在战略上的重要性。在那个时期，美丽并非单纯的外貌特征，还往往与政治资源和战略权力相关联。曹操认为拥有大乔会对他的政治野心有所帮助，并为其权势扩大增添筹码 [Data: Reports (6)]。

大乔作为孙策的妻子，再加上她的地位和美貌，使得她在权力动态中成为一个关键的目标。曹操对大乔的重视，不仅因为她个人的吸引力，更因为她所代表的权力象征。如果能将大乔纳入自己的势力范围，曹操将可能通过这种方式削弱敌对势力，同时增强自己的影响力 [Data: Reports (6)]。

总之，曹操和大乔的关系不仅仅停留在个人层面，更深层次的是政治策略和权力博弈的体现。曹操对她的兴趣，反映了他在权力斗争中的精明与野心。

那么，曹操对大乔是展现的政治和战略兴趣嘛，换句话说，大乔是一种战略物资或人物嘛，emmm

triplex

?cookbook	https://r2r-docs.sciphi.ai/cookbooks/knowledge-graph
?blog	https://www.sciphi.ai/blog/triplex
⛳Demo	kg.sciphi.ai

unsetunsettriplex介绍unsetunset

如果说上一节是对于庞大的项目框架中提取核心业务作为nano，那么本节算是从成本进行了降本增效。在nano-graphRAG的未来计划中，作者提出了triplex的概念，即Add Sciphi Triplex as the entity extraction model.而在fastChat、camel等llm框架中也有issue问到了加入triplex该项目，包括微软graphRAG，我甚至看到triplex作者亲自跑到graph下提了一个feature request，但被当场给毙了，emmm：

那么triplex是一种什么样的方案？可以从下图知晓：

triplex 全称为Scihpi Triplex模型，是一种开源大语言模型，专注于从非结构化数据中提取知识图谱。它的名字来源于其能够提取文本中的实体、关系和实体类型。这个模型是基于53.8亿参数的语言模型进行微调，旨在创建高质量的知识图谱。它与graphRAG的对比如表格所示：

Advantage	Triplex	Microsoft GraphRAG
Cost	10x cheaper	High cost due to token generation
Performance	Comparable to GPT-4, better efficiency	High resource consumption
Flexibility	Supports diverse and complex inputs	Limited by high cost
Ease of Use	Immediate usage with R2R RAG engine and Neo4J	Complex and resource-intensive process
Training Data	Extensive, from DBPedia, Wikidata, synthetic datasets

这样对比下来，确实很强，于是我准备开始测试。

unsetunsettriplex 部署unsetunset

这里我没有选择直接用openai_key做cloud LLM，因为手上的刚好过期了，暂时不想充钱，所以就用本地部署来调用，官方推荐了两种方式：

Docker：Install and run R2R with Docker
Local：Install and run R2R directly on your local system (not recommended).

虽然官方不推荐local，但我看了下toml配置，感觉关联服务不是很多，考虑到某些众所周知的关于docker镜像源的原因，还是一意孤行了，顺带从SciPhi-AI repo中学了下neo4j镜像怎么装，我之前在基于知识图谱和文档树增强的RAG实验记录一文中提到需要在启动容器内更改neo4j.conf，不过可以在run的时候直接指定所有参数也行：

docker run -d \
  --name neo4j \
  --restart unless-stopped \
  -p 7474:7474 \
  -p 7687:7687 \
  -v $(pwd)/neo4j/conf:/conf \
  -v $(pwd)/neo4j/data:/data \
  -v $(pwd)/neo4j/import:/import \
  -v $(pwd)/neo4j/logs:/logs \
  -v $(pwd)/neo4j/plugins:/plugins \
  -e NEO4J_dbms_memory_pagecache_size=1G \
  -e NEO4J_dbms_memory_heap_initial_size=1G \
  -e NEO4J_dbms_memory_heap_max__size=1G \
  -e NEO4J_AUTH=none \
  -e apoc.export.file.enabled=true \
  -e apoc.import.file.enabled=true \
  -e apoc.import.file.use_neo4j_config=true \
  -e NEO4JLABS_PLUGINS="graph-data-science apoc" \
  -e NEO4J_dbms_security_procedures_allowlist="gds.* apoc.*" \
  -e NEO4J_dbms_security_procedures_unrestricted="gds.* apoc.*" \
  -e dbms.connector.bolt.listen_address=0.0.0.0:7687 \
  neo4j:4.4.5-community

但这只是local R2R的其中一难，官方文档对于local部分写得不清不楚才是更大的灾难，在我发现issue里甚至没几个本地服务化的，就去找了目前有用的docker加速器转为完全镜像化，于是开始以下的pulling：

2R now runs on port 7272 by default!
Pulling Docker images...
[+] Pulling 91/39
 ✔ setup-token Skipped - Image is already being pulled by hatchet-setup-config                                                                                        0.0s
 ✔ hatchet-api Pulled                                                                                                                                               742.2s
 ⠧ neo4j [⣿⣿⣿⣿⡀⠀] Pulling                                                                                                                                          1168.8s
 ⠧ hatchet-engine [⣿⣿⣿⡀]  67.5MB / 83.69MB Pulling                                                                                                                 1168.8s
 ✔ hatchet-setup-config Pulled                                                                                                                                     1055.4s
 ✔ hatchet-migration Pulled                                                                                                                                         977.5s
 ✔ hatchet-rabbitmq Pulled                                                                                                                                         1014.5s
 ✔ traefik Pulled                                                                                                                                                  1087.8s
 ✔ hatchet-dashboard Pulled                                                                                                                                         135.0s
 ⠧ r2r [⣿⣿⣿⣿⣿⣿⣿⠀⣿⣿⣿⣿⣿] 466.1MB / 3.307GB Pulling                                                                                                                   1168.8s
 ✔ postgres Pulled                                                                                                                                                 1062.0s
 ✔ r2r-dashboard Pulled                                                                                                                                            1077.8s

上述12个容器若状态都为healthy的话，需要再安装一下ollama的模型服务：

  ollama pull sciphi/triplex
  ollama pull llama3.1
  ollama pull mxbai-embed-large
  ollama serve

ollama非本地ip或11434端口，可在r2r镜像启动时加入环境变量：

OLLAMA_API_BASE=http://your_host:11434 r2r serve --docker --config-path=/xxx/local_llm_neo4j_kg.toml

unsetunsettriplex 使用unsetunset

如果是整个服务都启动正常，服务器的7273和7274端口为webui界面，其中7274的邮箱验证对我网易邮箱好像有问题，7273倒是OK的，如下图所示，是一个集成了服务监控、llm聊天、日志收集和展示配置项的web界面：

这里还能对整个流程的所有prompt进行查看和修改：

真正用还是得在代码端或者终端，官方也准备了这两种方式，自我感觉前者更适合个性化，比如我的脚本为：

from r2r import R2RClient

client = R2RClient("http://192.168.8.176:7272")
health_response = client.health()
client.login("admin@example.com", "change_me_immediately")

file_paths = ['1.txt', '2.txt']
# metadatas = [{'key1': 'value1'}, {'key2': 'value2'}]

# ingest_response = client.ingest_files(
#     file_paths=file_paths,
    # metadatas=metadatas,
    # optionally override chunking settings at runtime
    # chunking_config={
    #     "strategy": "auto",
    #     "chunking_strategy": "by_title",
    #     "new_after_n_chars": 256, # soft maximum
    #     "max_characters": 512, # hard maximum
    #     "combine_under_n_chars": 64, # hard minimum
    #     "overlap": 100,
    # },
# )

documents_overview = client.documents_overview()
print(documents_overview)

# document_id = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"
# chunks = client.document_chunks(document_id)
# print(len(chunks))

# search_response = client.search("What was Uber's profit in 2020?")
# print(search_response)

hybrid_search_response = client.search(
    "What was Uber's profit in 2020?",
    vector_search_settings={
        "use_hybrid_search": True,
        "search_limit": 20,
        "hybrid_search_settings": {
            "full_text_weight": 1.0,
            "semantic_weight": 10.0,
            "full_text_limit": 200,
            "rrf_k": 25,
        },
    }
)

kg_search_response = client.search(
    "What is airbnb",
    vector_search_settings={"use_vector_search": False}
    kg_search_settings={
      "use_kg_search": True,
      "kg_search_type": "local",
      "kg_search_level": "0",
      "kg_search_generation_config": {
          "model": "gpt-4o-mini",
          "temperature": 0.7,
      },
      "local_search_limits": {
          "__Entity__": 20,
          "__Relationship__": 20,
          "__Community__": 20,
      },
      "max_community_description_length": 65536,
      "max_llm_queries_for_global_search": 250
    }
)

triplex支持多文本文件输入，直接用ingest_files进入它的整个pipleline，但过程挺慢的，我是将官方示例的aristotle.txt复制了两份为1.txt和 2.txt，不过我a4000卡的算力太低，我看说明，单文件文档中的说法是不到10s，大概是调用的openai key，本地llama3.1加上我的算力，乘个20吧，emmm。

之后就是照着它文档中的接口案例，一步步去加，hybrid_search_response实验似乎没有什么问题，kg_search_response我一直调用失败，异常为叫我在kg search前做KG enrichment，文档里却没看见介绍这个python的接口，只有命令行，然后命令行也有问题：

r2r create-graph --document-ids=9fbe403b-c11c-5aae-8ade-ef22980c3ad1
"""
Time taken: 0.24 seconds
{
  "results": {
    "message": "Graph creation task queued successfully. Please check http://<your-hatchet-gui-url> for completion status.",
    "task_id": "965a5ef4-2476-4f56-b98a-93459f34cb9a"
  }
}
"""

r2r inspect-knowledge-graph
"""
Time taken: 0.23 seconds

== Graph Statistics ==
Number of nodes: 0
Number of edges: 0
Number of connected components: 0

== Most Central Nodes ==

"""

所以，最后我的评价是，文档还能有很大提升空间，pipeline做得太复杂了，额外服务做了太多，建议后续的尝试，可以只用sciphi发布在huggingface上的模型。

itext2KG

?论文名	《iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models》
?地址	https://arxiv.org/abs/2409.03284
⛳Official repo	https://github.com/AuvaLab/itext2kg

unsetunsetitext2KG 方案说明unsetunset

上一节中的triplex从官方文档中表述是可以与graphRAG做到一样的效果，但它的model不是zero-shot，如果要实现本地化，不用r2r提供的pipeline，而是自己写并且要效果好的话，我感觉也有一定工作量，而本节提出的itext2KG是一种不需要预先定义的本体或大量的监督学习，就能够从原始文档中构建出一致的知识图谱的方案。

论文中提到，当前存在的问题包括：

大多数可用数据是未结构化的，这使得获取有价值的信息变得具有挑战性。
传统的自然语言处理（NLP）方法，如命名实体识别和关系提取，虽然在信息检索中至关重要，但存在局限性，包括依赖预定义的实体类型和需要监督学习。
当前的研究虽然利用了大型语言模型的能力，但在实体和关系的解析中仍然存在未解决的问题，如语义重复和不一致的图谱，这需要大量的后处理。
许多现有方法依赖于特定主题，限制了它们在不同领域的通用性。

而论文的解决方案为：

自动化地从大量自然语言文本中提取和构建知识图谱，以结构化数据并使其可访问。
提供一种无需后处理的增量构建知识图谱的方法，这种方法不依赖于特定主题，可以跨多个领域应用。
通过四个模块（文档蒸馏器、增量实体提取器、增量关系提取器和图谱集成器）来提高实体和关系的解析质量，减少冗余和歧义。
利用大型语言模型的零样本学习能力，提供即插即用的解决方案，减少对大量训练或微调的需求。

这里比较有意思的是module 1到3，算是定义了一种规则，或者说提供了一种范式，我最近刚好想要做这方向上的东西。

文档蒸馏器（Document Distiller）

Document Distiller 模块是 iText2KG 方法的一个关键组成部分，它负责将原始文档转换成结构化的语义块，以便后续的知识图谱构建。该模块利用大型语言模型的能力，根据用户提供的 JSON 模式（蓝图）来提取文档中的相关信息。这个过程不仅提高了信息的组织性，还为图谱构建提供了清晰的指导。

我主要想看的就是Distiller 模块下的Schema 规范：

这样做的目的，相当于能引导语言模型在保持其他类别灵活性的同时偏向于特定类别。它的主要目标为：

（a）：通过减少可能用冗余信息污染图谱的噪声来提高信噪比；
（b）：使用模式指导图谱构建过程，特别是对于概念键。

a点不用说，b点的意思我举个例子就是，假设我们有一个科学文章的文档，我们想要构建一个知识图谱，其中包括文章的标题、作者、发表年份和关键词等信息。我们可以定义以下 JSON 模式：

{
  "title": "",
  "authors": [],
  "publication_year": "",
  "keywords": []
}

当 Document Distiller 模块处理这个科学文章时，它会使用大型语言模型来识别和填充这个 JSON 模式。模型会根据训练时学到的知识，识别出文档中与这些概念键相关联的文本信息，并将其填充到相应的位置。例如，它可能会从文档中提取出标题 "Deep Learning Approaches" 并将其放入 title 键对应的位置。

增量实体提取器 and 增量关系提取器

作者这里用了两个算法流程图概括了过程：

这两种从本节最开始的4 module的总流程图中可以看到是非常相似的，只是module 3对接module 4的时候，pipeline稍微加了一点步骤，而 增量实体提取器（Incremental Entities Extractor） 的过程为：

全局实体集合初始化 ：首先，从第一个文档（或语义块）中使用大型语言模型（LLM）提取实体，形成全局实体集合。这些实体被认为是彼此独特的。
局部实体提取 ：对于后续的每个文档，提取局部实体集合。
实体匹配 ：将局部实体与全局实体集合进行匹配。如果找到匹配的实体，则将其添加到匹配集合中。如果没有找到匹配项，但局部实体与全局实体集合中的某个实体足够相似（基于预定义的相似度阈值），则将最相似的全局实体添加到匹配集合中。
更新全局实体集合 ：将匹配集合与全局实体集合合并，更新全局实体集合。
迭代处理 ：重复上述步骤，直到所有文档都被处理完毕，从而得到一个全面的全局实体集合。

unsetunsetitext2KG 使用与对比unsetunset

作者在论文中介绍完整个框架流程后，就开始了对比实验，测试集为简历、科学文章和网站，这里就不再详述表格数据内容，主要的可视化可以看下图，我感觉效果还不错：

而itext2KG的安装与使用也非常的简单，与graphRAG一样只需要pip install，使用首先需要启用DocumentDistiller：


from itext2kg import DocumentDistiller
# You can define a schema or upload some predefined ones.
from itext2kg.utils import Article

# Define your OpenAI API key.
OPENAI_API_KEY = "####"

# Initialize the DocumentDistiller with the OpenAI API key.
document_distiller = DocumentDistiller(openai_api_key=OPENAI_API_KEY)

# List of documents to be distilled.
documents = ["doc1", "doc2", "doc3"]

# Information extraction query.
IE_query = '''
# DIRECTIVES : 
- Act like an experienced information extractor. 
- You have a chunk of a scientific paper.
- If you do not find the right information, keep its place empty.
'''

# Distill the documents using the defined query and output data structure.
distilled_doc = document_distiller.distill(documents=documents, IE_query=IE_query, output_data_structure=Article)

然后进行提取：

# Initialize iText2KG with the OpenAI API key.
itext2kg = iText2KG(openai_api_key=OPENAI_API_KEY)

# Format the distilled document into semantic sections.
semantic_blocks = [f"{key} - {value}".replace("{", "[").replace("}", "]") for key, value in distilled_doc.items()]

# Build the knowledge graph using the semantic sections.
global_ent, global_rel = itext2kg.build_graph(sections=semantic_blocks)

最后可视化：

from itext2kg.graph_integration import GraphIntegrator

URI = "bolt://localhost:####"
USERNAME = "####"
PASSWORD = "####"


new_graph = {}
new_graph["nodes"] = global_ent
new_graph["relationships"] = global_rel

GraphIntegrator(uri=URI, username=USERNAME, password=PASSWORD).visualize_graph(json_graph=new_graph)

写到这里，有openai key的可以尝试一番，因为issue第一条里作者就表示，未来会考虑加入langchain支持的chatglm等大模型api，但目前还是以gpt为主，所以上述的代码我没尝试过，只是将github中的复制了过来，但这种思想还是很有意义的。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业