利用 GPT-4 从生物医学文本构建知识图谱

发布日期：2024-07-14 11:24:32 浏览次数： 2849

作者：知识图谱科技

微信搜一搜，关注“知识图谱科技”

摘要：

本文探讨了如何利用检索增强生成（RAG）系统，特别是在生物医学研究中，进行全面检索以检索信息，展示了通过使用先进的人工智能模型创建知识图谱等综合搜索方法所涉及的细微差别。

主要观点：

- 检索增强生成（RAG）系统在特定研究主题的详尽检索中发挥了关键作用。

- 使用像 GPT-4 这样的先进人工智能模型构建知识图谱可以增强诸如生物医学研究之类领域的信息检索能力。

- 本文强调了深入搜索以进行全面数据提取的重要性，特别是在需要详尽信息的领域，例如生物医学研究中的临床试验。

来源：

[阅读原文](https://medium.com/@venkat.ramrao/building-a-knowledge-graph-for-biomedical-texts-using-gpt-4-4c6deb2c0864)

正文

组织和检索非结构化文本数据可能具有挑战性。

最简单的方法包括从文本中提取元数据（例如作者、主题、撰写日期），并使用该元数据设置索引。对于搜索-简单的单词搜索适用于较小的数据集。更复杂的搜索方法可以包括文本的稀疏（例如BM25）或密集（例如嵌入模型）向量表示。这些可以使您在文本中表示词频和语义含义。

然后就有了知识图谱...

RAG一种独特系统分支

检索增强生成（RAG）系统的一个有趣用例是在设计系统支持研究并查找与某一主题相关的论文/出版物时。传统的RAG系统涉及使用各种语义和非语义搜索方法对文本语料库进行搜索，根据检索结果中最接近的匹配项得出答案。

相比之下，设计用于研究目的的RAG系统时，您需要进行彻底的搜索以检索有关某一主题的所有信息。例如，生物医学研究人员可能会寻求检索某一特定疗法的所有临床试验或所有涵盖某一主题的论文。

知识图谱提供了一种将文本中的知识组织起来，以支持诸如上述查询的方法，而无需使用过于耗时的搜索算法。图数据库，例如Neo4j、Neptune、NebulaGraph等，可以在规模上实现这一点，并使用与SQL同等复杂度的语言检索文本。

例如，看以下文本（来源：bioRxiv.org - 生物学预印本服务器）；

酵母菌的生态位追寻范围从酿酒厂延伸到橡树，最近更扩展至Crabro Wasps的肠道。在此，我们提出人体肠道在塑造酵母菌进化中的作用，展示了以前未知的与Crohn病相关的酵母菌群的遗传结构，为人体肠道内克隆扩张提供了证据。为了解人体-酵母相互作用中免疫功能的作用，我们根据它们的免疫调节特性对菌株进行分类，发现了一组遗传同质的分离株，能够通过调节性T细胞增殖诱导抗炎信号的传递，相反，菌株杂合性与诱导IL-17驱动的炎症免疫反应能力之间存在积极关联。将基因组学与免疫表型进行整合的方法显示，参与孢子形成和细胞壁重塑的基因的选择对S. cerevisiae Crohn株从乘客到共生体再到潜在病原体的进化起着重要作用。

以上文本创建图谱表示的一种方法如下。这是所有文本的表示形式，其中科学术语被提取后以MeSH术语标准化，并保持了相关性。这是一个有向图（即相关性具有方向）。

利用GPT-4

OpenAI的GPT-4提供了一种直接的方法来创建类似上面那样的图谱。

我用来创建上述图谱的代码可以在这里找到。

在我的测试中，我专注于生物医学数据——特别是来自biorixiv.org的摘要。通过一些修改，您可以使用PubMed API、ClinicalTrials.gov API或其他类似的来源做类似的事情。

关键的设计考虑因素是要了解您想在节点和边缘中捕获哪些元素。我觉得生物医学研究人员会对文本中提到的特定基因、蛋白质、医疗治疗/疗法、副作用、疾病、任何症状等科学术语感兴趣。

而且，从小片段文本中获得一个庞大的知识图谱非常容易，因此限制感兴趣的项目是有意义的。

以下提示和调用代码对我效果很好。您需要修改system_text以便将关注点放在您想要作为节点的特定项目上。

#first prompt and OpenAI call to extract node-relation triplets
client = OpenAI()

system_text = """You are an expert Knowledge Graph developer with deep knowledge of Biology.
Your job is to take a piece of text given to you and extract all the nodes and edges within that text.

The nodes include items such as scientific terms, study methods, scientific studies, chemicals, specific genes mentioned, proteins mentioned, medical treatments, side effects, diseases, any symptoms, mechanisms of action etc.
You MUST create the response in the form of a triplet. For example: <NODE>--<RELATIONSHIP>--<NODE>

Here is an example of a input text and output.

INPUT TEXT:
Apicomplexan parasites are thought to actively invade the host cell by gliding motility.
Recent studies demonstrated that Toxoplasma gondii can invade the host cell in the absence of several core components of the invasion machinery, such as the motor protein myosin A (MyoA), the microneme proteins MIC2 and AMA1 and actin, indicating the presence of alternative invasion mechanisms.
Here the roles of MyoA, MLC1, GAP45 and Act1, core components of the gliding machinery, are re-dissected in detail.

OUTPUT:
<Apicomplexan parasites>--<thought to actively invade by gliding motility>--<host cell>
<Apicomplexan parasites>--<uses>--<gliding motility>
<host cells>--<effected by>--<gliding motility>
<Toxoplasma gondii>--<type of>--<Apicomplexan parasites>
<MyoA>--<type of>--<motor protein>
<MIC2>--<type of>--<microneme protein>
<AMA1>--<type of>--<microneme protein>
<actin>--<type of>--<microneme protein>
<MyoA>--<component of >--<gliding machinery>
<MLC1>--<component of >--<gliding machinery>
<GAP45>--<component of >--<gliding machinery>
<Act1>--<component of >--<gliding machinery>
"""

user_text = f"""
INPUT TEXT:
{input_text}

OUTPUT:
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_text},
{"role": "user", "content": user_text}
],
temperature=0,
)

resp = response.choices[0].message.content

上述代码将返回<NODE> — -<EDGE/RELATIONSHIP> — -<NODE>三元组。然而，节点并未标准化。例如，在第一个测试集中，我看到节点，如 Saccharomyces cerevisiae 和 S. cerevisiae，这些显然是重复的。一些去重算法可能是有益的。

在我的情况下，我决定使用 MeSH（医学主题词表）术语来标准化术语。这并不完美，但有效。幸运的是，GPT-4 似乎知道如何将科学术语转换为 MeSH 术语。

system_text = """You are an expert Medical Researcher with deep knowledge on BioMedical and Health related concepts and terms.
Your job is to take a list of medical terms and provide the list of MeSH(Medical Subject Headings) terms tied to them. If a term does not have a MeSH term return "NONE"

You MUST create the response in the form of a triplet. For example: INPUT TERM<--->MeSH TERM

Here is an example of a input text and output.

INPUT TEXT:
Cancer
Human Gut
Bob Smith
injury to the esophagus caused by acid reflux

OUTPUT:
Cancer<--->Neoplasm
Human Gut<--->Gastrointestinal Tract
Bob Smith<--->NONE
injury to the esophagus caused by acid reflux<--->Reflux Esophagitis
"""

input_text = "\n".join(unique_nodes)

user_text = f"""
INPUT TEXT:
{input_text}

OUTPUT:
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_text},
{"role": "user", "content": user_text}
],
temperature=0,
)

standardized_nodes = response.choices[0].message.content

在标准化节点的过程中，我确实丢失了一些信息。例如，“tumorigenesis”、“918 cancer samples”和“132 cancer types”都被翻译为“Neoplasms”。然而，我觉得我很可能会将这个知识图用作RAG系统的辅助，并且仍然会保留原始文本，所以对我来说这是可以接受的权衡。

可视化

对于庞大的知识图谱，您将需要一个图数据库。从Neo4j、Neptune和Cosmos等付费数据库到Nebula Graph等开源数据库，有很多选择。

对于较小的用例，您可能可以通过将节点和边存储在内存中来完成。我决定利用Python字典来加快数据访问速度。

edges = []
labels = []
edge_labels = {} # stores a list of labels for each edge
edge_labels_with_doi = {} # stores a list of labels for each edge - label includes doi
node_edge_dict = {}# holds a list of edges tied to a node. Will help with searching in the future

for line in lines:
nodes = line.split(">--<")
node1 = nodes[0].replace("<","")
node1 = stnd_dict[node1]
node2 = nodes[2].replace(">","")
node2 = stnd_dict[node2]
edge = (node1, node2)
label = nodes[1]
edges.append(edge)
labels.append(label)

#append the labels to the edges
if (edge in edge_labels):
temp_label = edge_labels[edge]
temp_label.append(label)
edge_labels[edge] = temp_label

temp_label2 = edge_labels_with_doi[edge]
temp_label2.append(label)
else:
temp_label = [label]
edge_labels[edge] = temp_label

#append the edges to the node dictionary. This will ease retrieval of edges for a node in subsequent steps

if (node1 in node_edge_dict):
temp_edge_list = node_edge_dict[node1]
temp_edge_list.append(edge)
node_edge_dict[node1] = temp_edge_list
else:
edge_list = []
edge_list.append(edge)
node_edge_dict[node1] = edge_list

if (node2 in node_edge_dict):
temp_edge_list = node_edge_dict[node2]
temp_edge_list.append(edge)
node_edge_dict[node2] = temp_edge_list
else:
edge_list = []
edge_list.append(edge)
node_edge_dict[node2] = edge_list

使用字典允许您快速访问与节点相关联的边缘，而无需求助图形数据库。

TEXT_OF_INTEREST = "Alternative Splicing"
if (TEXT_OF_INTEREST in node_edge_dict):
print(node_edge_dict[TEXT_OF_INTEREST])
for i in node_edge_dict[TEXT_OF_INTEREST]:
print(edge_labels[i])

您还应该能够轻松地从连接的节点延伸出去，构建到指定深度的子图。

为了访问和使用便利，您可以使用NetworkX库。以下代码是创建有向多图的基本方法。

import networkx as nx
import matplotlib.pyplot as plt

# Create an empty graph
G = nx.MultiDiGraph()

# Add nodes and edges

for e in edge_labels:
G.add_edge(e[0], e[1], radius = 0.15, label = edge_labels[e])

pos = nx.spring_layout(G)# Choose a layout 
f = plt.figure()

#adjust the figure size
f.set_figwidth(15)
f.set_figheight(15)

# SIMPLE DRAW
#nx.draw(G, pos, with_labels=True)

#COMPLEX DRAW
# Draw nodes and labels
nx.draw_networkx_nodes(G, pos)
nx.draw_networkx_labels(G, pos)

# Draw curved edges with different radii
for u, v, attrs in G.edges(data=True):
radius = attrs['radius']
nx.draw_networkx_edges(G, pos, edgelist=[(u, v)], connectionstyle=f'arc3, rad={radius}', label=attrs['label'])

# Draw edge labels
edge_labels = {(u, v): attrs['label'] for u, v, attrs in G.edges(data=True)}
nx.draw_networkx_edge_labels(G, pos, edge_labels)

# Display the graph
plt.show()

结论

这样，您就可以从一组文本中得到一个有向知识图谱。

这显然是一个简单的示例，上面的代码需要针对更大的语料库进行修改。需要一些思考来选择正确的数据以形成节点，并对节点标准化进行微调。在某个时候，您需要决定是否需要图形数据库。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业