GPT-4o Mini与GraphRAG打造‘病案信息学’知识图谱

发布日期：2024-07-21 10:37:45 浏览次数： 3384

前情提要

微软发布了RAG知识库方案 GraphRAG，目前已获得 12.1K 的 Star！RAG 技术可以显著提升大模型生成内容的质量和实用性。
OpenAI 突然推出了新模型 GPT-4o mini，其价格相比 GPT-3.5 Turbo 降低了 **60%**！

本项目采用了 GraphRAG 与 Neo4j 构建知识图谱。其中，选择的 LLM 是 GPT-4o mini，而 Embedding 模型则使用了 text-embedding-ada-002。

如需快速启动 GraphRAG，请访问官方链接：GraphRAG 快速启动

建议使用 Python 3.11

# 安装 graphrag 库
pip install graphrag

# 创建一个文件夹，用于存放需要进行 RAG 的知识库文件
mkdir -p ./ragtest/input

# 微软官方推荐使用狄更斯的《圣诞颂歌》（约 189K），但我们可以选择现代网络小说，以便更容易获取到实体。
# 目前只支持 utf-8 格式的 txt 文件和 csv 文件。
curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt

# 初始化项目后，会自动创建 .env 和 .settings.yaml 两个配置文件。
# 在 .env 中填入 API 密钥。
# 在 settings.yaml 中修改 LLM 的 model 和 api_base，以及 embeddings 的 model 和 api_base。
python -m graphrag.index --init --root ./ragtest

# 准备工作完成后，便可以启动索引器
python -m graphrag.index --root ./ragtest

# 使用全局范围搜索提出问题的示例：
python -m graphrag.query --root ./ragtest --method global "这本书主要讲了什么内容？"

# 使用局部搜索提出问题的示例：
python -m graphrag.query --root ./ragtest --method local "常见的主导词有哪些?"

# 注意：该步骤相当耗时，具体时间取决于数据量、所选模型及文件块的大小。

完成快速启动后，项目目录下将生成以下文件：

./ragtest/output/<timestamp>/artifacts# 包含一系列 parquet 文件

分别使用global、local模式进行提问：


> python -m graphrag.query --root ./ragtest --method global "这本书主要讲了什么内容？"
SUCCESS: Global Search Response: ## 书籍内容概述

这本书主要探讨了国际疾病分类（ICD）系统，特别是ICD-10和ICD-11在全球健康管理中的重要性。它强调了世界卫生组织（WHO）在这些分类系统的发展和维护中的关键角色，指出ICD-10的标准化对于公共卫生管理至关重要，使各国能够有效共享和比较健康数据 [Data: Reports (180)]。

### 肿瘤管理与医疗记录

书中还详细讨论了肿瘤的分类和治疗方法，强调了ICD-10在肿瘤管理中的重要性，以确保医疗记录的一致性和患者获得适当的护理 [Data: Reports (186)]。此外，书中提到ICD-9-CM-3和ICD-11的演变，讨论了这些分类系统在医疗诊断和程序中的基础性作用，以及它们如何影响医疗数据的管理和公共卫生策略 [Data: Reports (165, 191)]。

### 医疗信息管理与编码

书籍还涉及医疗信息管理，特别是与ICD-11分类系统相关的内容，强调了有效出版在传播医疗信息中的重要性 [Data: Reports (83)]。它探讨了外科手术技术和程序，包括泌尿科和男性生殖手术，强调了这些手术在治疗相关健康问题中的重要性 [Data: Reports (150, 162)]。

### 特定疾病与标准化

此外，书中讨论了特定医疗条件的分类，如脑出血和低血糖，强调了这些条件在ICD-10系统中的分类及其对母婴健康和患者管理的影响 [Data: Reports (182)]。书中还探讨了感染性疾病及其并发症的相互关系，强调了标准化疾病分类的重要性 [Data: Reports (78)]。

### 结论

总体而言，这本书提供了对ICD系统及其在全球健康管理中的应用的深入分析，强调了标准化和准确性在医疗记录和患者护理中的重要性。通过对不同疾病和治疗方法的分类，书籍为医疗专业人员提供了宝贵的参考资料，以改善医疗结果和公共卫生策略 [Data: Reports (10, 172, +more)]。

> python -m graphrag.query --root ./ragtest --method local "常见的主导词有哪些?"
SUCCESS: Local Search Response: ## 常见的主导词

在医学术语和手术操作的分类中，主导词的选择至关重要。主导词是指在查询中最重要的表达词汇，代表各类手法操作的核心内容。以下是一些常见的主导词及其相关信息：

### 1. 切开术
切开术是一个广泛使用的主导词，通常用于多种手法类型的分类。它可以作为其他手法的先行步骤，具有重要的治疗意义。与切开术相关的手法包括引流术、探查术、减压术等，这些手法在手术过程中相互关联，形成一个完整的治疗方案 [Data: Sources (297, 312); Entities (1031, 1247, 1248)]。

### 2. 主导词与修饰词
主导词下通常会有一级修饰词，用于进一步细化和描述主导词的含义。例如，主导词“主导词”下的一级修饰词可以是“一级修饰词”，这有助于在手术操作分类中提供更具体的信息 [Data: Relationships (1209, 1210)]。

### 3. 其他相关主导词
除了切开术，其他常见的主导词还包括“异常”、“损伤”、“白内障”等。这些词汇在医学名词标准化工作中被广泛使用，帮助医疗专业人员在进行编码和分类时保持一致性和准确性 [Data: Entities (985, 244, 481)]。

### 4. 主导词的选择原则
选择主导词时，需掌握一定的转换规则，以确保在查找手术操作编码时的准确性。主导词的选择不仅影响编码的准确性，还直接关系到医疗记录的完整性和医疗活动的规划 [Data: Sources (312, 21)]。

### 总结
主导词在医学术语和手术操作的分类中起着关键作用。通过合理选择和使用主导词，医疗专业人员能够更有效地进行手术操作的记录和分类，从而提高医疗服务的质量和效率。这些主导词的标准化和系统化是医学名词审定委员会和相关机构持续努力的方向 [Data: Entities (23, 22); Relationships (52, 56)]。

我使用的 txt 文件大小约为 289Kb，llm 和 embeddings 的费用总计约为￥4.3。而官方案例中使用 GPT-4 的费用大约为 $11。

将 Parquet 文件转换为 CSV 并导入到 Neo4j

1. 转换 .parquet 文件为 .csv 文件

下面是将 Parquet 文件转换为 CSV 文件的 Python 脚本：

import os
import pandas as pd
import csv

# 定义 Parquet 文件目录
parquet_dir = './ragtest/output/20240720-140425/artifacts'
csv_dir = 'neo4j-community-5.21.2/import'

# 清理和格式化字符串字段的函数
def clean_quotes(value):
    if isinstance(value, str):
        # 去除多余的引号和空白
        value = value.strip().replace('""', '"').replace('"', '')
        # 为包含逗号或引号的字段添加正确的引号
        if ',' in value or '"' in value:
            value = f'"{value}"'
    return value

# 转换所有 Parquet 文件为 CSV
for file_name in os.listdir(parquet_dir):
    if file_name.endswith('.parquet'):
        parquet_file = os.path.join(parquet_dir, file_name)
        csv_file = os.path.join(csv_dir, file_name.replace('.parquet', '.csv'))
        
        # 加载 Parquet 文件
        df = pd.read_parquet(parquet_file)
        
        # 清理字符串字段
        for column in df.select_dtypes(include=['object']).columns:
            df[column] = df[column].apply(clean_quotes)
        
        # 保存为 CSV 文件
        df.to_csv(csv_file, index=False, quoting=csv.QUOTE_NONNUMERIC)
        print(f"Converted {parquet_file} to {csv_file} successfully.")

print("All Parquet files have been converted to CSV.")

2. 在 Neo4j 中导入 CSV 数据

将转换好的 CSV 文件放在 Neo4j 的 import 文件夹后，使用以下 Cypher 命令导入数据：

// 1. 导入文档
LOAD CSV WITH HEADERS FROM 'file:///create_final_documents.csv' AS row
CREATE (d:Document {
    id: row.id,
    title: row.title,
    raw_content: row.raw_content,
    text_unit_ids: row.text_unit_ids
});

// 2. 导入文本单元
LOAD CSV WITH HEADERS FROM 'file:///create_final_text_units.csv' AS row
CREATE (t:TextUnit {
    id: row.id,
    text: row.text,
    n_tokens: toFloat(row.n_tokens),
    document_ids: row.document_ids,
    entity_ids: row.entity_ids,
    relationship_ids: row.relationship_ids
});

// 3. 导入实体
LOAD CSV WITH HEADERS FROM 'file:///create_final_entities.csv' AS row
CREATE (e:Entity {
    id: row.id,
    name: row.name,
    type: row.type,
    description: row.description,
    human_readable_id: toInteger(row.human_readable_id),
    text_unit_ids: row.text_unit_ids
});

// 4. 导入关系
LOAD CSV WITH HEADERS FROM 'file:///create_final_relationships.csv' AS row
CREATE (r:Relationship {
    source: row.source,
    target: row.target,
    weight: toFloat(row.weight),
    description: row.description,
    id: row.id,
    human_readable_id: row.human_readable_id,
    source_degree: toInteger(row.source_degree),
    target_degree: toInteger(row.target_degree),
    rank: toInteger(row.rank),
    text_unit_ids: row.text_unit_ids
});

// 5. 导入节点
LOAD CSV WITH HEADERS FROM 'file:///create_final_nodes.csv' AS row
CREATE (n:Node {
    id: row.id,
    level: toInteger(row.level),
    title: row.title,
    type: row.type,
    description: row.description,
    source_id: row.source_id,
    community: row.community,
    degree: toInteger(row.degree),
    human_readable_id: toInteger(row.human_readable_id),
    size: toInteger(row.size),
    entity_type: row.entity_type,
    top_level_node_id: row.top_level_node_id,
    x: toInteger(row.x),
    y: toInteger(row.y)
});

// 6. 导入社区
LOAD CSV WITH HEADERS FROM 'file:///create_final_communities.csv' AS row
CREATE (c:Community {
    id: row.id,
    title: row.title,
    level: toInteger(row.level),
    raw_community: row.raw_community,
    relationship_ids: row.relationship_ids,
    text_unit_ids: row.text_unit_ids
});

// 7. 导入社区报告
LOAD CSV WITH HEADERS FROM 'file:///create_final_community_reports.csv' AS row
CREATE (cr:CommunityReport {
    id: row.id,
    community: row.community,
    full_content: row.full_content,
    level: toInteger(row.level),
    rank: toFloat(row.rank),
    title: row.title,
    rank_explanation: row.rank_explanation,
    summary: row.summary,
    findings: row.findings,
    full_content_json: row.full_content_json
});

// 8. 创建索引以提高性能
CREATE INDEX FOR (d:Document) ON (d.id);
CREATE INDEX FOR (t:TextUnit) ON (t.id);
CREATE INDEX FOR (e:Entity) ON (e.id);
CREATE INDEX FOR (r:Relationship) ON (r.id);
CREATE INDEX FOR (n:Node) ON (n.id);
CREATE INDEX FOR (c:Community) ON (c.id);
CREATE INDEX FOR (cr:CommunityReport) ON (cr.id);

// 9. 在所有节点导入后创建关系
MATCH (d:Document)
UNWIND split(d.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (d)-[:HAS_TEXT_UNIT]->(t);

MATCH (t:TextUnit)
UNWIND split(t.document_ids, ',') AS docId
MATCH (d:Document {id: trim(docId)})
CREATE (t)-[:BELONGS_TO]->(d);

MATCH (t:TextUnit)
UNWIND split(t.entity_ids, ',') AS entityId
MATCH (e:Entity {id: trim(entityId)})
CREATE (t)-[:HAS_ENTITY]->(e);

MATCH (t:TextUnit)
UNWIND split(t.relationship_ids, ',') AS relId
MATCH (r:Relationship {id: trim(relId)})
CREATE (t)-[:HAS_RELATIONSHIP]->(r);

MATCH (e:Entity)
UNWIND split(e.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (e)-[:MENTIONED_IN]->(t);

MATCH (r:Relationship)
MATCH (source:Entity {name: r.source})
MATCH (target:Entity {name: r.target})
CREATE (source)-[:RELATES_TO]->(target);

MATCH (r:Relationship)
UNWIND split(r.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (r)-[:MENTIONED_IN]->(t);

MATCH (c:Community)
UNWIND split(c.relationship_ids, ',') AS relId
MATCH (r:Relationship {id: trim(relId)})
CREATE (c)-[:HAS_RELATIONSHIP]->(r);

MATCH (c:Community)
UNWIND split(c.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (c)-[:HAS_TEXT_UNIT]->(t);

MATCH (cr:CommunityReport)
MATCH (c:Community {id: cr.community})
CREATE (cr)-[:REPORTS_ON]->(c);

3. 完成数据导入

已经成功完成了从 Parquet 文件转换到 CSV，并导入到 Neo4j 的所有步骤！

接下来可以对数据库进行查询了。以下是一些Neo4j 查询语言 Cypher 编写的查询示例。

// 1. Visualize Document to TextUnit relationships
MATCH (d:Document)-[r:HAS_TEXT_UNIT]->(t:TextUnit)
RETURN d, r, t
LIMIT 50;

// 2. Visualize Entity to TextUnit relationships
MATCH (e:Entity)-[r:MENTIONED_IN]->(t:TextUnit)
RETURN e, r, t
LIMIT 50;

// 3. Visualize Relationships between Entities
MATCH (e1:Entity)-[r:RELATES_TO]->(e2:Entity)
RETURN e1, r, e2
LIMIT 50;

// 4. Visualize Community to Relationship connections
MATCH (c:Community)-[r:HAS_RELATIONSHIP]->(rel:Relationship)
RETURN c, r, rel
LIMIT 50;

// 5. Visualize Community Reports and their Communities
MATCH (cr:CommunityReport)-[r:REPORTS_ON]->(c:Community)
RETURN cr, r, c
LIMIT 50;

// 6. Visualize the most connected Entities (Updated)
MATCH (e:Entity)
WITH e, COUNT{(e)-[:RELATES_TO]->(:Entity)} AS degree
ORDER BY degree DESC
LIMIT 10
MATCH (e)-[r:RELATES_TO]->(other:Entity)
RETURN e, r, other;

// 7. Visualize TextUnits and their connections to Entities and Relationships
MATCH (t:TextUnit)-[:HAS_ENTITY]->(e:Entity)
MATCH (t)-[:HAS_RELATIONSHIP]->(r:Relationship)
RETURN t, e, r
LIMIT 50;

// 8. Visualize Documents and their associated Entities (via TextUnits)
MATCH (d:Document)-[:HAS_TEXT_UNIT]->(t:TextUnit)-[:HAS_ENTITY]->(e:Entity)
RETURN d, t, e
LIMIT 50;

// 9. Visualize Communities and their TextUnits
MATCH (c:Community)-[:HAS_TEXT_UNIT]->(t:TextUnit)
RETURN c, t
LIMIT 50;

// 10. Visualize Relationships and their associated TextUnits
MATCH (r:Relationship)-[:MENTIONED_IN]->(t:TextUnit)
RETURN r, t
LIMIT 50;

// 11. Visualize Entities of different types and their relationships
MATCH (e1:Entity)-[r:RELATES_TO]->(e2:Entity)
WHERE e1.type <> e2.type
RETURN e1, r, e2
LIMIT 50;

// 12. Visualize the distribution of Entity types
MATCH (e:Entity)
RETURN e.type AS EntityType, COUNT(e) AS Count
ORDER BY Count DESC;

// 13. Visualize the most frequently occurring relationships
MATCH ()-[r:RELATES_TO]->()
RETURN TYPE(r) AS RelationshipType, COUNT(r) AS Count
ORDER BY Count DESC
LIMIT 10;

// 14. Visualize the path from Document to Entity
MATCH path = (d:Document)-[:HAS_TEXT_UNIT]->(t:TextUnit)-[:HAS_ENTITY]->(e:Entity)
RETURN path
LIMIT 25;