使用高精度RAG实现对表格数据的检索

发布日期：2024-09-01 15:41:53 浏览次数： 3834

作者：大数据技术体系

微信搜一搜，关注“大数据技术体系”

为什么针对表格丰富文档的RAG表现糟糕？

检索增强生成（RAG）革命已经迅速发展了一段时间，但这条路并非一帆风顺——尤其是在处理非文本元素，如图片和表格时。一直困扰我的一个问题是，每次要求RAG工作流程从表格中提取特定值时，准确率都会下降。当文档中包含多个与相关主题相关的表格，比如在盈利报告中，情况就更加糟糕了。所以，我开始了改善我的RAG管道中表格检索功能的任务...

主要挑战：

检索不一致性：向量搜索算法往往难以准确定位到正确的表格，尤其是在包含多个相似表格的文档中。
生成不准确：大型语言模型（LLMs）经常误解或误识别表格中的值，尤其是在具有嵌套列的复杂表格中。我的假设是这可能由于格式不一致性导致的。

解决方案：

我采用了四个关键概念来解决这个问题：

精确提取：干净利落地从文档中提取所有表格。
上下文丰富：利用大型语言模型（LLM）通过分析提取的表格及其周围文档内容，生成每个表格的强大、上下文相关的描述。
格式标准化：使用LLM将表格转换为统一的Markdown格式，提高嵌入效率和LLM的理解能力。
统一嵌入：通过结合上下文描述与Markdown格式的表格，创建一个“表格块”，优化其用于向量数据库的存储和检索。

实现

目标： 为Meta的财报数据^[1]构建一个RAG（检索、回答、生成）管道，用于从文档文本和多个表格中检索和回答问题。

查看完整的笔记本在Google Colab^[2] - 本文介绍了如何创建一个带有上下文化表格块的可扩展应用程序，完整的笔记本还包括了与非上下文化表格块使用的比较。

第一步：精确提取

首先，我们需要从文档中提取文本和表格，为此我们将使用Unstructured.io^[3]。让我们安装和导入所有依赖项：

!apt-get -qq install poppler-utils tesseract-ocr%pip install -q --user --upgrade pillow%pip install -q --upgrade unstructured["all-docs"]%pip install kdbai_client%pip install langchain-openai%pip install langchain%pip install langchain-community%pip install pymupdf%pip install --upgrade nltk
import osfrom getpass import getpassimport openaifrom openai import OpenAIfrom unstructured.partition.pdf import partition_pdffrom unstructured.partition.auto import partitionfrom langchain_openai import OpenAIEmbeddingsimport kdbai_client as kdbaifrom langchain_community.vectorstores import KDBAIfrom langchain.chains import RetrievalQAfrom langchain_openai import ChatOpenAIimport fitznltk.download('punkt')

设置 OpenAI API 密钥：

# 配置 OpenAI APIif "OPENAI_API_KEY" in os.environ:KDBAI_API_KEY = os.environ["OPENAI_API_KEY"]else:# 提示用户输入 API 密钥OPENAI_API_KEY = getpass("请输入 OPENAI API 密钥: ")# 将 API 密钥保存为当前会话的环境变量os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

下载Meta公司2024年第二季度财报PDF^[4]（包含大量表格！）：

!wget 'https://s21.q4cdn.com/399680738/files/doc_news/Meta-Reports-Second-Quarter-2024-Results-2024.pdf' -O './doc1.pdf'

我们将使用 Unstructured 的 'partition_pdf^[5]' 功能，并实施 'hi_res' 分区策略来从 PDF 利润报告中提取文本和表格元素。

在分区过程中，我们可以设置一些参数，以确保准确从 PDF 中提取表格。

strategy = "hi_res": 识别文档布局，适用于对正确元素分类敏感的使用场景，例如表格元素。
chunking_strategy = "by_title": 'by_title' 分块策略通过在遇到 '标题' 元素时开始新的分块来保留章节边界，即使当前分块有空间，也确保不同章节的文本不会出现在同一个分块中。您还可以使用 max_characters 和 new_after_n_chars 指定分块大小。

elements = partition_pdf('./doc1.pdf',策略="高分辨率",分块策略="按标题",最大字符数=2500,每n字符后新建=2300,)

让我们看看提取了哪些元素：

from collections import Counter显示(Counter(元素.__class__ for 元素 in elements))

>>> Counter({unstructured.documents.elements.CompositeElement: 17, unstructured.documents.elements.Table: 10})

我们提取出了17个CompositeElement元素，它们基本上是文本块。还有10个Table元素，即提取出的表格。

到目前为止，我们已经从文档中提取了文本块和表格。

第二步 & 第三步：表格上下文丰富和格式标准化

让我们来看看一个表格元素，看看我们是否可以理解为什么在 RAG 流程中可能会出现与之相关的问题。倒数第二个元素是一个表格元素：

print(elements[-2])>>>Foreign exchange effect on 2024 revenue using 2023 rates Revenue excluding foreign exchange effect GAAP revenue year-over-year change % Revenue excluding foreign exchange effect year-over-year change % GAAP advertising revenue Foreign exchange effect on 2024 advertising revenue using 2023 rates Advertising revenue excluding foreign exchange effect 2024 $ 39,071 371 $ 39,442 22 % 23 % $ 38,329 367 $ 38,696 22 % 2023 $ 31,999 $ 31,498 2024 $ 75,527 265 $ 75,792 25 % 25 % $ 73,965 261 $ 74,226 24 % 2023 GAAP advertising revenue year-over-year change % Advertising revenue excluding foreign exchange effect year-over-year change % 23 % 25 % Net cash provided by operating activities Purchases of property and equipment, net Principal payments on finance leases $ 19,370 (8,173) (299) $ 10,898 $ 17,309 (6,134) (220) $ 10,955 $ 38,616 (14,573) (614) $ 23,429

我们注意到表格被表示为一段包含自然语言和数字的混合长字符串。如果我们仅以此作为要被摄入到RAG管道中的表格块，很容易看出要判断这个表格是否应该被检索出来将会非常困难。

我们需要为每个表格添加上下文，并将其格式化为markdown。

为此，我们首先将从PDF文档中提取整个文本作为上下文使用：

def extract_text_from_pdf(pdf_path):text = ""with fitz.open(pdf_path) as doc:for page in doc:text += page.get_text()return text
pdf_path = './doc1.pdf'document_content = extract_text_from_pdf(pdf_path)

接下来，创建一个函数，该函数将接受整个文档的上下文（如上代码所示），以及特定表格的提取文本，并输出一个新描述，其中包含对表格的全面描述，以及将表格本身转换为markdown格式的表格：

# Initialize the OpenAI clientclient = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def get_table_description(table_content, document_context):prompt = f"""Given the following table and its context from the original document,provide a detailed description of the table. Then, include the table in markdown format.
Original Document Context:{document_context}
Table Content:{table_content}
Please provide:1. A comprehensive description of the table.2. The table in markdown format."""
response = client.chat.completions.create(model="gpt-4o",messages=[{"role": "system", "content": "You are a helpful assistant that describes tables and formats them in markdown."},{"role": "user", "content": prompt}])
return response.choices[0].message.content

现在，通过将上述函数应用于所有表格元素，并将每个原始表格元素的文本替换为新描述（包括表格的上下文描述和Markdown格式化的表格）来整合所有内容：

# Process each table in the directoryfor element in elements:if element.to_dict()['type'] == 'Table':table_content = element.to_dict()['text']
# Get description and markdown table from GPT-4oresult = get_table_description(table_content, document_content)# Replace each Table elements text with the new descriptionelement.text = result
print("Processing complete.")

示例：增强的表格块/元素（以下为 markdown 格式，便于阅读）:### 表格详细描述

This markdown table provides a concise presentation of the financial data, making it easy to read and comprehend in a digital format.### Detailed Description of the Table
The table presents segment information from Meta Platforms, Inc. for both revenue and income (loss) from operations. The data is organized into two main sections: 1. **Revenue**: This section is subdivided into two categories: "Advertising" and "Other revenue". The total revenue generated from these subcategories is then summed up for two segments: "Family of Apps" and "Reality Labs". The table provides the revenue figures for three months and six months ended June 30, for the years 2024 and 2023.2. **Income (loss) from operations**: This section shows the income or loss from operations for the "Family of Apps" and "Reality Labs" segments, again for the same time periods.
The table allows for a comparison between the two segments of Meta's business over time, illustrating the performance of each segment in terms of revenue and operational income or loss. 
### The Table in Markdown Format
```markdown### Segment Information (In millions, Unaudited)
|| Three Months Ended June 30, 2024 | Three Months Ended June 30, 2023 | Six Months Ended June 30, 2024 | Six Months Ended June 30, 2023 ||----------------------------|----------------------------------|----------------------------------|------------------------------- |-------------------------------|| **Revenue:** ||| | || Advertising| $38,329| $31,498| $73,965 | $59,599 || Other revenue| $389 | $225 | $769| $430|| **Family of Apps** | $38,718| $31,723| $74,734 | $60,029 || Reality Labs | $353 | $276 | $793| $616|| **Total revenue**| $39,071| $31,999| $75,527 | $60,645 ||||| | || **Income (loss) from operations:** ||| | || Family of Apps | $19,335| $13,131| $36,999 | $24,351 || Reality Labs | $(4,488) | $(3,739) | $(8,334)| $(7,732)|| **Total income from operations** | $14,847| $9,392 | $28,665 | $16,619 |

正如您所看到的，这比表格元素原始文本提供了更多上下文，这将显著提升我们 RAG 管道的工作效率。现在我们拥有了完全上下文化的表格片段，可以通过嵌入并将它们存储在我们的矢量数据库中来为检索做准备。

第 4 步：统一嵌入... 为 RAG 准备准备迎接RAG

现在，所有元素都已经具备了进行高质量检索和生成的必要上下文，我们将提取这些元素，将它们嵌入，并将它们存储在KDB.AI^[6]向量数据库中。

首先，我们将为每个元素创建嵌入，嵌入只是每个元素的语义意义的数值表示：

from unstructured.embed.openai import OpenAIEmbeddingConfig, OpenAIEmbeddingEncoder``````pythonembedding_encoder = OpenAIEmbeddingEncoder(config=OpenAIEmbeddingConfig(api_key=os.getenv("OPENAI_API_KEY"),model_name="text-embedding-3-small",))elements = embedding_encoder.embed_documents(elements=elements)

接下来，创建一个Pandas DataFrame来存储我们的元素。这个DataFrame将包含基于每个元素通过Unstructured提取的属性列。例如，Unstructured为每个元素创建了一个ID、文本（我们已对表格元素进行过处理）、元数据和嵌入（如上所示）。我们将这些数据存储在DataFrame中，因为这个格式很容易被KDB.AI向量数据库接受。

import pandas as pddata = []``````pythonfor c in elements:row = {}row['id'] = c.idrow['text'] = c.textrow['metadata'] = c.metadata.to_dict()row['embedding'] = c.embeddingsdata.append(row)
df = pd.DataFrame(data)

设置 KDB.AI 云服务：

您可以在以下链接免费获取 KDB.AI API 密钥和端点：https://trykdb.kx.com/kdbai/signup/^[7]

KDBAI_ENDPOINT = (os.environ["KDBAI_ENDPOINT"]if "KDBAI_ENDPOINT" in os.environelse input("KDB.AI endpoint: "))KDBAI_API_KEY = (os.environ["KDBAI_API_KEY"]if "KDBAI_API_KEY" in os.environelse getpass("KDB.AI API key: "))
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

现在，您已经连接到了向量数据库实例，下一步是定义您将在KDB.AI中创建的表的架构：

schema = {'columns': [{'name': 'id', 'pytype': 'str'},{'name': 'text', 'pytype': 'str'},{'name': 'metadata', 'pytype': 'dict'},{'name': 'embedding','vectorIndex': {'dims': 1536, 'type': 'flat', 'metric': 'L2'}}]}

我们在之前创建的 DataFrame 的模式中为每个列创建了一个列。（id，文本，元数据嵌入）。嵌入列是独特的，因为这是定义 vectorIndex 的地方，也是执行向量搜索以检索数据的地方。这里定义了几个参数：

dims：每个嵌入的维度数 - 由使用的嵌入模型决定。在这种情况下，OpenAI 的 'text-embedding-3-small' 输出 1536 维度的嵌入。
type: 索引的类型，这里简单地使用扁平索引，但也可以使用 qFlat（磁盘上的扁平索引）、HNSW、IVF、IVFPQ。
metric: 用于向量搜索的度量标准。L2 是欧几里得距离，其他选项包括余弦相似度和点积。

基于上述模式的表创建：

KDBAI_TABLE_NAME = "Table_RAG"
# 首先确保表不存在if KDBAI_TABLE_NAME in session.list():session.table(KDBAI_TABLE_NAME).drop()```# 创建表```pythontable = session.create_table(KDBAI_TABLE_NAME, schema)

将 DataFrame 插入到 KDB.AI 表中：

# 将元素插入到 KDB.AI 表中table.insert(df)

所有元素现在都已存储在向量数据库中，准备进行查询以进行检索。

使用 LangChain 和 KDB.AI 执行 RAG！

使用 LangChain^[8] 的基本设置：

# Define OpenAI embedding model for LangChain to embed the queryembeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# use KDBAI as vector storevecdb_kdbai = KDBAI(table, embeddings)

定义一个使用 KDB.AI 作为检索器和 gpt-4o 作为生成器的 RAG 链：

# 定义一个问题/答案 LangChain 链qabot = RetrievalQA.from_chain_type(chain_type="stuff",llm=ChatOpenAI(model="gpt-4o"),retriever=vecdb_kdbai.as_retriever(search_kwargs=dict(k=5)),return_source_documents=True,)

辅助函数以执行 RAG：

# 辅助函数以执行 RAGdef RAG(query):print(query)print("-----")return qabot.invoke(dict(query=query))["result"]

结果

示例 1:

# Query the RAG chain!RAG("what is the 2024 GAAP advertising Revenue in the three months  ended June 30th? What about net cash by operating activies")

结果：截至 2024 年 6 月 30 日：

GAAP 广告收入为 383.29 亿美元。
经营活动产生的净现金流为 193.70 亿美元。

示例 2：

# Query the RAG chain!RAG("what is the three month costs and expensis for 2023?")

结果：

2023年第二季度，Meta Platforms, Inc. 的三个月成本和支出为226.07亿美元。

示例 3：

# Query the RAG chain!RAG("At the end of 2023, what was the value of Meta's Goodwill assets?")

结果：截至2023年底，Meta的商誉资产价值为206.54亿美元。

示例 4：

# Query the RAG chain!RAG("What is the research and development costs for six months ended in June 2024")

结果：截至2024年6月结束的六个月研发成本为205.15亿美元。

!!注意：这是一个例子，如果你使用了非上下文化的表格片段，它可能会出错。这意味着表格越复杂，添加额外的上下文和格式就越有帮助。

示例 5：

# Query the RAG chain!RAG("Given a sentiment score between 1 and 10 for the outlook? Explain your reasoning")

结果：

I would give the sentiment score for Meta Platforms, Inc.'s outlook an **8 out of 10**. Here's the reasoning behind this score:
### Positive Indicators:1. **Significant Increase in Earnings Per Share (EPS)**: - **Basic EPS** has risen from $3.03 in Q2 2023 to $5.31 in Q2 2024, and from $5.24 in H1 2023 to $10.17 in H1 2024. - **Diluted EPS** has also shown substantial growth, from $2.98 in Q2 2023 to $5.16 in Q2 2024, and from $5.18 in H1 2023 to $9.86 in H1 2024. 2. **Revenue Growth**: - Revenue increased by 22% from $31.999 billion in Q2 2023 to $39.071 billion in Q2 2024.
3. **Improved Income from Operations**: - Income from operations rose by 58%, from $9.392 billion in Q2 2023 to $14.847 billion in Q2 2024. - Operating margin improved from 29% in Q2 2023 to 38% in Q2 2024.
4. **Net Income Growth**: - Net income for Q2 2024 was $13.465 billion, marking a 73% increase from $7.788 billion in Q2 2023.
5. **Effective Tax Rate**: - The effective tax rate decreased from 16% in Q2 2023 to 11% in Q2 2024, benefiting overall profitability.
### Negative or Neutral Indicators:1. **Increase in Costs and Expenses**: - Total costs and expenses increased by 7%, from $22.607 billion in Q2 2023 to $24.224 billion in Q2 2024.
2. **Decrease in Retained Earnings**: - Retained earnings slightly decreased from $82,070 million at the end of 2023 to $81,188 million by June 30, 2024.
### Conclusion:The significant improvements in EPS, revenue, income from operations, and net income indicate strong financial performance and a positive outlook for Meta Platforms, Inc. The increase in costs and expenses and a slight decrease in retained earnings are areas to watch, but they don't outweigh the overall positive momentum. Hence, the sentiment score of 8 reflects a strong outlook with some room for careful monitoring of expenses.

我们注意到，LLM 能够利用嵌入式表格中的数字来为其生成的情感评分提供推理依据。

考虑因素

虽然增加额外上下文可能会提升你基于大量表格的RAG管道的结果，但这是一种成本更高的方法，因为需要额外的调用LLM来收集和创建这些上下文。此外，对于只有少量简单表格的数据集来说，可能并不需要这样做。我的实验表明，对于简单的表格，使用非上下文化的表格块效果相当不错。然而，随着表格的复杂化，例如在“示例 4”中看到的嵌套列，非上下文化的表格块就显露出不足了。

总结

针对大量表格型文档的准确检索增强生成（RAG）挑战，需要一种系统性的方法来同时解决检索不一致和生成不准确的问题。通过实施包括精确提取、上下文丰富、格式标准化和统一嵌入的策略，我们可以显著提升处理复杂表格时的RAG管道性能。

我们的Meta收益报告示例的结果突出了在使用这些丰富的表格块时生成响应的质量。随着RAG（Retrieval-Augmented Generation，检索增强生成）技术的不断演进，将这些技术融入其中可能会成为确保可靠和精确结果的一个伟大工具，尤其是在包含大量表格的数据集中。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业