改进向量搜索-使用PostgresML和LlamaIndex重新排名

发布日期：2024-07-27 13:09:52 浏览次数： 2679

作者：大数据杂货铺

微信搜一搜，关注“大数据杂货铺”

搜索和重新排名：提高结果相关性

搜索系统通常采用两种主要方法：关键字和语义。关键字搜索将精确的查询词与索引数据库内容匹配，而语义搜索使用 NLP 和机器学习来理解查询上下文和意图。许多有效的系统结合了这两种方法以获得最佳结果。

初始检索后，重新排序可以进一步提高结果相关性。传统的重新排序依赖于历史用户交互数据，但这种方法难以处理新内容，并且需要大量数据才能有效训练。一种先进的替代方法是使用交叉编码器，它直接比较查询结果对的相似性。

交叉编码器直接比较两段文本并计算相似度得分。与传统的语义搜索方法不同，我们无法预先计算交叉编码器的嵌入并在以后重复使用它们。相反，我们必须对每一对想要比较的文本运行交叉编码器，这使得这种方法在计算上非常昂贵，并且不适用于大规模搜索。但是，它对于重新排序我们数据集的子集非常有效，因为它擅长评估新的、未见过的数据，而无需大量用户交互数据进行微调。

交叉编码器弥补了传统重排序系统在深度文本分析方面的局限性，尤其是针对新颖或高度特定内容。它们不依赖大量用户交互数据集进行训练（尽管此类数据仍然很有用），并且擅长处理新数据和以前从未见过的数据。这使得交叉编码器成为在重排序环境中增强搜索结果相关性的绝佳选择。

实施重新排名

我们将使用 LlamaIndex 和 PostgresML 托管索引实现一个简单的重新排名示例。有关 PostgresML 托管索引的更多信息。查看我们关于 LlamaIndex 的公告：使用 LlamaIndex + PostgresML 简化您的 RAG 应用程序架构。

安装所需的依赖项以开始使用：

pip install llama_index llama-index-indices-managed-postgresml

我们将使用 Paul Graham 数据集，可以通过 curl 下载：

mkdir data

curl -o data/paul_graham_essay.txt https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt

PostgresML 托管索引将处理存储、拆分、嵌入和查询我们的文档。我们需要的只是一个数据库连接字符串。如果您还没有，请创建您的 PostgresML 帐户。完成您的个人资料后，您将获得 100 美元的免费积分。

设置 PGML_DATABASE_URL 环境变量：

export PGML_DATABASE_URL="{YOUR_CONNCECTION_STRING}"

让我们创建索引：

from llama_index.core.readers import SimpleDirectoryReader
from llama_index.indices.managed.postgresml import PostgresMLIndex


documents = SimpleDirectoryReader("data").load_data()
index = PostgresMLIndex.from_documents(
    documents, collection_name="llama-index-rerank-example"
)

请注意，collection_name 用于唯一标识您正在使用的索引。

这里我们使用 SimpleDirectoryReader 来加载文档，然后从这些文档构造 PostgresMLIndex。

此工作流程不需要文档预处理。相反，文档会直接发送到 PostgresML，并根据管道规范进行存储、拆分和嵌入。这是使用 PostgresML 托管索引的独特品质。

现在让我们搜索一下！我们可以执行语义搜索，并通过从索引中创建检索器来获取前 2 个结果。

retriever = index.as_retriever(limit=2)
docs = retriever.retrieve("What did the author do as a child?")
for doc in docs:
    print("---------")
    print(f"Id: {doc.id_}")
    print(f"Score: {doc.score}")
    print(f"Text: {doc.text}")

这样做我们得到：

---------

Id: de01b7e1-95f8-4aa0-b4ec-45ef64816e0e

Score: 0.7793415653313153

Text: Wow, I thought, there's an audience. If I write something and put it on the web, anyone can read it. That may seem obvious now, but it was surprising then. In the print era there was a narrow channel to readers, guarded by fierce monsters known as editors. The only way to get an audience for anything you wrote was to get it published as a book, or in a newspaper or magazine. Now anyone could publish anything.



This had been possible in principle since 1993, but not many people had realized it yet. I had been intimately involved with building the infrastructure of the web for most of that time, and a writer as well, and it had taken me 8 years to realize it. Even then it took me several years to understand the implications. It meant there would be a whole new generation of essays. [11]



In the print era, the channel for publishing essays had been vanishingly small. Except for a few officially anointed thinkers who went to the right parties in New York, the only people allowed to publish essays were specialists writing about their specialties. There were so many essays that had never been written, because there had been no way to publish them. Now they could be, and I was going to write them. [12]



I've worked on several different things, but to the extent there was a turning point where I figured out what to work on, it was when I started publishing essays online. From then on I knew that whatever else I did, I'd always write essays too.



---------

Id: de01b7e1-95f8-4aa0-b4ec-45ef64816e0e

Score: 0.7770352826735559

Text: Asterix comics begin by zooming in on a tiny corner of Roman Gaul that turns out not to be controlled by the Romans. You can do something similar on a map of New York City: if you zoom in on the Upper East Side, there's a tiny corner that's not rich, or at least wasn't in 1993. It's called Yorkville, and that was my new home. Now I was a New York artist — in the strictly technical sense of making paintings and living in New York.



I was nervous about money, because I could sense that Interleaf was on the way down. Freelance Lisp hacking work was very rare, and I didn't want to have to program in another language, which in those days would have meant C++ if I was lucky. So with my unerring nose for financial opportunity, I decided to write another book on Lisp. This would be a popular book, the sort of book that could be used as a textbook. I imagined myself living frugally off the royalties and spending all my time painting. (The painting on the cover of this book, ANSI Common Lisp, is one that I painted around this time.)



The best thing about New York for me was the presence of Idelle and Julian Weber. Idelle Weber was a painter, one of the early photorealists, and I'd taken her painting class at Harvard. I've never known a teacher more beloved by her students. Large numbers of former students kept in touch with her, including me. After I moved to New York I became her de facto studio assistant.

这些结果还不错，但并不完美。让我们尝试使用交叉编码器进行重新排序。

retriever = index.as_retriever(
    limit=2,
    rerank={
        "model": "mixedbread-ai/mxbai-rerank-base-v1",
        "num_documents_to_rerank": 100
    }
)
docs = retriever.retrieve("What did the author do as a child?")
for doc in docs:
    print("---------")
    print(f"Id: {doc.id_}")
    print(f"Score: {doc.score}")
    print(f"Text: {doc.text}")

在这里，我们将检索器配置为返回排名前两个的文档，但这次，我们添加了一个重新排名参数以使用mixedbread-ai/mxbai-rerank-base-v1模型。这意味着我们的初始语义搜索将返回 100 个结果，然后由mixedbread-ai/mxbai-rerank-base-v1模型对这些结果进行重新排名，并且仅显示排名前两个的结果。

运行此输出：

Id: de01b7e1-95f8-4aa0-b4ec-45ef64816e0e
Score: 0.17803585529327393
Text: What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.


---------
Id: de01b7e1-95f8-4aa0-b4ec-45ef64816e0e
Score: 0.1057136133313179
Text: I wanted not just to build things, but to build things that would last.

In this dissatisfied state I went in 1988 to visit Rich Draves at CMU, where he was in grad school. One day I went to visit the Carnegie Institute, where I'd spent a lot of time as a kid. While looking at a painting there I realized something that might seem obvious, but was a big surprise to me. There, right on the wall, was something you could make that would last. Paintings didn't become obsolete. Some of the best ones were hundreds of years old.

And moreover this was something you could make a living doing. Not as easily as you could by writing software, of course, but I thought if you were really industrious and lived really cheaply, it had to be possible to make enough to survive. And as an artist you could be truly independent. You wouldn't have a boss, or even need to get research funding.

I had always liked looking at paintings. Could I make them? I had no idea. I'd never imagined it was even possible. I knew intellectually that people made art — that it didn't just appear spontaneously — but it was as if the people who made it were a different species. They either lived long ago or were mysterious geniuses doing strange things in profiles in Life magazine. The idea of actually being able to make art, to put that verb before that noun, seemed almost miraculous.

这些结果好多了！我们可以看到，最上面的文档回答了用户的问题。请注意，我们不必指定用于重新排名的第三方 API。再一次，PostgresML 使用数据库中的交叉编码器来处理重新排名。

我们可以在 RAG 中直接使用重新排序：

query_engine = index.as_query_engine(
    streaming=True,
    vector_search_limit=2,
    vector_search_rerank={
        "model": "mixedbread-ai/mxbai-rerank-base-v1",
        "num_documents_to_rerank": 100,
    },
)
results = query_engine.query("What did the author do as a child?")
for text in results.response_gen:
    print(text, end="", flush=True)

运行此输出：

Based on the context information, as a child, the author worked on writing (writing short stories) and programming (on the IBM 1401 using Fortran) outside of school.

这正是我们想要的答案！

重新排序可带来更好的结果

搜索可能很复杂。使用交叉编码器进行重新排序可以通过比较文本对并有效处理新数据来改进搜索。使用 LlamaIndex 和 PostgresML 实现重新排序可以改进搜索结果，在检索增强生成应用程序中提供更精确的答案。

要开始使用 PostgresML 和 LlamaIndex，您可以按照 PostgresML 入门指南设置您的帐户，并将上述示例与您自己的数据一起使用。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业