高级 RAG 02：揭秘 PDF 解析

发布日期：2024-06-25 19:54:39 浏览次数： 5280

作者：AI自习室

微信搜一搜，关注“AI自习室”

对于 RAG（Retrieval-Augmented Generation，检索增强生成）来说，从文档中提取信息是一个不可避免的步骤。确保从源文档中有效提取内容对于提高最终的输出质量至关重要。

在实施 RAG 时，不应低估这一过程。在解析过程中信息提取不佳可能导致对 PDF 文件中所含信息的理解和利用受限。

RAG 中解析过程的位置如图 1 所示：

图1

在实际工作中，非结构化数据比结构化数据要丰富得多。如果无法解析这些大量的数据，它们巨大的价值就无法实现。

在非结构化数据中，PDF 文件占了大多数。 有效处理 PDF 文件也可以极大地帮助管理其他类型的非结构化文档。

本文主要介绍了解析 PDF 文件的方法。它提供了有效解析 PDF 文档和尽可能多地提取有用信息的算法和建议。

解析 PDF 的挑战

PDF 文档是非结构化文档的代表，然而，从 PDF 文档中提取信息是一个具有挑战性的过程。

将 PDF 简单定义为数据格式并不准确，更恰当的描述是它是一套打印指令的组合。PDF 文件由一系列指令组成，这些指令告诉 PDF 阅读器或打印机在屏幕或纸张上如何及在哪里呈现字符。这与如 HTML 和 docx 等文件格式不同，后者使用标签来组织不同的逻辑结构，如图2所示。

图2

解析 PDF 文档的难点主要在于如何精确地捕捉页面的整体布局，并将包括表格、标题、段落及图片在内的内容转译为文档的文字形式。这一过程包括处理文本抽取的不精确、图像的识别问题，以及表格中行与列关系的识别混乱。

如何解析 PDF 文档

通常，有三种解析 PDF 的方法：

基于规则的方法：这种方法依据文档的组织特性来确定每个部分的样式和内容。但是，鉴于 PDF 的种类和布局千差万别，这种方法的适用性较差，很难通过预设的规则覆盖所有情形。
基于深度学习模型的方法：比如，一个流行的解决方案是结合了物体检测和 OCR（光学字符识别）模型。
基于多模态大模型的方法：通过这种方法可以解析 PDF 中的复杂结构或提取关键信息。

基于规则的方法

其中最具代表性的工具之一是pypdf，这是一个广泛使用的基于规则的解析器。它是LangChain和LlamaIndex中解析PDF文件的标准方法。

下面是使用 pypdf 解析《Attention Is All You Need》论文第6页的尝试。原始页面如图3所示。

 图3

代码如下：

import PyPDF2  
filename = "/Users/Florian/Downloads/1706.03762.pdf"  
pdf_file = open(filename, 'rb')  
  
reader = PyPDF2.PdfReader(pdf_file)  
  
page_num = 5  
page = reader.pages[page_num]  
text = page.extract_text()  
  
print('--------------------------------------------------')  
print(text)  
  
pdf_file.close()

执行的结果是（为简洁起见省略其余部分）：

(py) Florian:~ Florian$ pip list | grep pypdf  
pypdf 3.17.4  
pypdfium2 4.26.0  
  
(py) Florian:~ Florian$ python /Users/Florian/Downloads/pypdf_test.py  
--------------------------------------------------  
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations  
for different layer types. nis the sequence length, dis the representation dimension, kis the kernel  
size of convolutions and rthe size of the neighborhood in restricted self-attention.  
Layer Type Complexity per Layer Sequential Maximum Path Length  
Operations  
Self-Attention O(n2·d) O(1) O(1)  
Recurrent O(n·d2) O(n) O(n)  
Convolutional O(k·n·d2) O(1) O(logk(n))  
Self-Attention (restricted) O(r·n·d) O(1) O(n/r)  
3.5 Positional Encoding  
Since our model contains no recurrence and no convolution, in order for the model to make use of the  
order of the sequence, we must inject some information about the relative or absolute position of the  
tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the  
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel  
as the embeddings, so that the two can be summed. There are many choices of positional encodings,  
learned and fixed [9].  
In this work, we use sine and cosine functions of different frequencies:  
PE(pos,2i)=sin(pos/100002i/d model)  
PE(pos,2i+1)=cos(pos/100002i/d model)  
where posis the position and iis the dimension. That is, each dimension of the positional encoding  
corresponds to a sinusoid. The wavelengths form a geometric progression from 2πto10000 ·2π. We  
chose this function because we hypothesized it would allow the model to easily learn to attend by  
relative positions, since for any fixed offset k,PEpos+kcan be represented as a linear function of  
PEpos.  
...  
...  
...

根据 PyPDF 检测的结果，可以观察到它将 PDF 中的字符序列序列化成一个长序列，而没有保留结构信息。换句话说，它将文档的每一行视为由换行符“\n”分隔的序列，这阻碍了对段落或表格的准确识别。

这一限制是基于规则方法的固有特性。

基于深度学习模型的方法

这种方法的优势是它能够准确地识别整个文档的布局，包括表格和段落。它甚至能理解表格内的结构。这意味着它能将文档划分为定义明确、信息完整的单元，同时保留预期的意义和结构。

然而，也存在一些局限性。对象检测和 OCR 阶段可能会耗时。因此，建议使用GPU或其他加速设备，并采用多进程和多线程进行处理。

这种方法涉及对象检测和 OCR 模型，我已经测试了几个代表性的开源框架：

Unstructured：它已被集成到langchain中。在启用 infer_table_structure=True 的 hi_res 策略下，表格识别效果良好。然而，fast 策略表现不佳，因为它没有使用对象检测模型，错误地识别了许多图像和表格。
Layout-parser：如果需要识别复杂结构的 PDF，建议使用最大的模型以获得更高的准确性，尽管可能会稍慢一些。此外，Layout-parser的模型在过去两年中似乎没有更新。
PP-StructureV2：采用多种模型组合进行文档分析，性能优于平均水平。架构展示于图4：

 图4

除了开源工具之外，还有一些如 ChatDOC 这样的付费工具，采用基于布局的识别加上OCR 技术来解析 PDF 文档。

下面，我们将介绍如何利用开源的 unstructured 框架来解析 PDF，并针对三大关键挑战进行说明。

挑战1：如何从表格和图片中抽取数据

这里，我们将使用 unstructured 框架作为例子。检测到的表格数据能够被直接导出为HTML格式。具体的代码实现如下：

from unstructured.partition.pdf import partition_pdf  
  
filename = "/Users/Florian/Downloads/Attention_Is_All_You_Need.pdf"  
  
# infer_table_structure=True automatically selects hi_res strategy  
elements = partition_pdf(filename=filename, infer_table_structure=True)  
tables = [el for el in elements if el.category == "Table"]  
  
print(tables[0].text)  
print('--------------------------------------------------')  
print(tables[0].metadata.text_as_html)

我已经详细跟踪了 partition_pdf 函数的内部执行流程。图5展示了其基础流程图。

 图5

代码的运行结果如下：

Layer Type Self-Attention Recurrent Convolutional Self-Attention (restricted) Complexity per Layer O(n2 · d) O(n · d2) O(k · n · d2) O(r · n · d) Sequential Maximum Path Length Operations O(1) O(n) O(1) O(1) O(1) O(n) O(logk(n)) O(n/r)
--------------------------------------------------
<table><thead><th>Layer Type</th><th>Complexity per Layer</th><th>Sequential Operations</th><th>Maximum Path Length</th></thead><tr><td>Self-Attention</td><td>O(n? - d)</td><td>O(1)</td><td>O(1)</td></tr><tr><td>Recurrent</td><td>O(n- d?)</td><td>O(n)</td><td>O(n)</td></tr><tr><td>Convolutional</td><td>O(k-n-d?)</td><td>O(1)</td><td>O(logy(n))</td></tr><tr><td>Self-Attention (restricted)</td><td>O(r-n-d)</td><td>ol)</td><td>O(n/r)</td></tr></table>

复制 HTML 标签并将它们保存为 HTML 文件。然后，使用 Chrome 打开它，结果如图6所示：

图6

可以观察到，unstructured 算法基本上恢复了整个表格。

挑战2：如何对检测到的内容块进行重新排列？特别是在处理双栏PDF文件时。

在处理双栏PDF时，以论文《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》为例。阅读顺序由红色箭头显示：

 图7

在确定了布局之后，unstructured 框架会将每个页面划分为若干个矩形区块，详细展示请见图8。

 图8

每个矩形块的详细信息可以通过以下格式获取：

[
LayoutElement(bbox=Rectangle(x1=851.1539916992188, y1=181.15073777777613, x2=1467.844970703125, y2=587.8204599999975), text='These approaches have been generalized to coarser granularities, such as sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sen- tence words given a representation of the previous sentence (Kiros et al., 2015), or denoising auto- encoder derived objectives (Hill et al., 2016). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9519357085227966, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=196.5296173095703, y1=181.1507377777777, x2=815.468994140625, y2=512.548237777777), text='word based only on its context. Unlike left-to- right language model pre-training, the MLM ob- jective enables the representation to fuse the left and the right context, which allows us to pre- In addi- train a deep bidirectional Transformer. tion to the masked language model, we also use a “next sentence prediction” task that jointly pre- trains text-pair representations. The contributions of our paper are as follows: ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9517233967781067, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=200.22352600097656, y1=539.1451822222216, x2=825.0242919921875, y2=870.542682222221), text='• We demonstrate the importance of bidirectional pre-training for language representations. Un- like Radford et al. (2018), which uses unidirec- tional language models for pre-training, BERT uses masked language models to enable pre- trained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9414362907409668, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=851.8727416992188, y1=599.8257377777753, x2=1468.0499267578125, y2=1420.4982377777742), text='ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding re- search along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual rep- resentation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-speciﬁc architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including ques- tion answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to pre- dict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation mod- els. ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.938507616519928, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=199.3734130859375, y1=900.5257377777765, x2=824.69873046875, y2=1156.648237777776), text='• We show that pre-trained representations reduce the need for many heavily-engineered task- speciﬁc architectures. BERT is the ﬁrst ﬁne- tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outper- forming many task-speciﬁc architectures. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9461237788200378, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=195.5695343017578, y1=1185.526123046875, x2=815.9393920898438, y2=1330.3272705078125), text='• BERT advances the state of the art for eleven NLP tasks. The code and pre-trained mod- els are available at https://github.com/ google-research/bert. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9213815927505493, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=195.33956909179688, y1=1360.7886962890625, x2=447.47264000000007, y2=1397.038330078125), text='2 Related Work ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8663332462310791, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=197.7477264404297, y1=1419.3353271484375, x2=817.3308715820312, y2=1527.54443359375), text='There is a long history of pre-training general lan- guage representations, and we brieﬂy review the most widely-used approaches in this section. ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.928022563457489, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=851.0028686523438, y1=1468.341394166663, x2=1420.4693603515625, y2=1498.6444497222187), text='2.2 Unsupervised Fine-tuning Approaches ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8346447348594666, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=853.5444444444446, y1=1526.3701822222185, x2=1470.989990234375, y2=1669.5843488888852), text='As with the feature-based approaches, the ﬁrst works in this direction only pre-trained word em- (Col- bedding parameters from unlabeled text lobert and Weston, 2008). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9344717860221863, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=200.00000000000009, y1=1556.2037353515625, x2=799.1743774414062, y2=1588.031982421875), text='2.1 Unsupervised Feature-based Approaches ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8317819237709045, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=198.64227294921875, y1=1606.3146266666645, x2=815.2886352539062, y2=2125.895459999998), text='Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, of- fering signiﬁcant improvements over embeddings learned from scratch (Turian et al., 2010). To pre- train word embedding vectors, left-to-right lan- guage modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to dis- criminate correct from incorrect words in left and right context (Mikolov et al., 2013). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9450697302818298, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=853.4905395507812, y1=1681.5868488888855, x2=1467.8729248046875, y2=2125.8954599999965), text='More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and ﬁne-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved pre- viously state-of-the-art results on many sentence- level tasks from the GLUE benchmark (Wang language model- Left-to-right et al., 2018a). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9476840496063232, image_path=None, parent=None)
]

其中（x1, y1）代表左上角顶点的坐标，而（x2, y2）则是右下角顶点的坐标：

     (x_1, y_1) --------
                 |             |
                 |             |
                 |             |
                 ---------- (x_2, y_2)

此时，您可以选择重新排列页面的阅读顺序。Unstructured 自带一个内置的排序算法，但我发现在处理双栏情况时排序结果并不令人满意。

因此，需要设计一个算法。最简单的方法是首先按左上顶点的水平坐标排序，如果水平坐标相同，则按垂直坐标排序。其伪代码如下：

layout.sort(key=lambda z: (z.bbox.x1, z.bbox.y1, z.bbox.x2, z.bbox.y2))

然而，我们发现即使是同一列中的块也可能在它们的水平坐标上有变化。如图9所示，紫色线块的水平坐标 bbox.x1 实际上更靠左。在排序时，它会被放置在绿色线块之前，这显然违反了阅读顺序。

图9

在这种情况下，一种可行的算法步骤如下：

首先，对所有左上角的 x 坐标 x1 进行排序，以此确定最小的 x1（x1_min）。
然后，对所有右下角的 x 坐标 x2 进行排序，以此确定最大的 x2（x2_max）。
接下来，我们需要计算页面中央线的x坐标，具体方法是：

x1_min = min([el.bbox.x1 for el in layout])
x2_max = max([el.bbox.x2 for el in layout])
mid_line_x_coordinate = (x2_max + x1_min) /  2

接下来，如果 bbox.x1 < mid_line_x_coordinate，则将该块归类为左列的一部分。否则，它被视为右列的一部分。

一旦分类完成，根据它们的y坐标对列中的每个块进行排序。最后，将右列连接到左列的右侧。

left_column = []
right_column = []
for el in layout:
    if el.bbox.x1 < mid_line_x_coordinate:
        left_column.append(el)
    else:
        right_column.append(el)

left_column.sort(key = lambda z: z.bbox.y1)
right_column.sort(key = lambda z: z.bbox.y1)
sorted_layout = left_column + right_column