如何使用每篇文档的多个向量进行检索

在文档中存储多个向量通常很有用。在多种使用场景下，这都具有优势。例如，我们可以将文档的多个片段进行嵌入，并将这些嵌入与父文档关联，从而使得检索器对片段的命中结果能够返回完整的文档。

LangChain 实现了一个基础的 MultiVectorRetriever，可简化此过程。大部分复杂性在于如何为每份文档创建多个向量。本笔记本介绍了一些常见的创建向量的方法，并演示了如何使用 MultiVectorRetriever。

每篇文档创建多个向量的方法包括：

较小的块：将文档拆分为更小的块，并对这些块进行嵌入（这是 ParentDocumentRetriever）。
摘要：为每份文档创建摘要，并将其嵌入（或替代）文档。
假设性问题：创建一些假设性问题，每个文档都适合回答这些问题，将这些问题（或代替文档）嵌入其中。

请注意，这还启用了一种手动添加嵌入的方法。这种方法很有用，因为你可以显式地添加一些应导致文档被检索的问题或查询，从而获得更大的控制权。

下面我们通过一个示例进行演示。首先，我们实例化一些文档。我们将使用 OpenAI 嵌入向量，将它们索引到一个（内存中）的 Chroma 向量存储中，但任何 LangChain 向量存储或嵌入模型均可使用。

%pip install --upgrade --quiet  langchain-chroma langchain langchain-openai > /dev/null

from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loaders = [
    TextLoader("paul_graham_essay.txt"),
    TextLoader("state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)

API 参考：InMemoryByteStore | TextLoader | OpenAIEmbeddings | RecursiveCharacterTextSplitter

较小的块

在许多情况下，检索较大的信息块但嵌入较小的块会很有用。这使得嵌入能够尽可能准确地捕捉语义含义，同时尽可能多地将上下文传递到下游。请注意，这就是 ParentDocumentRetriever 所做的。这里我们展示其内部的工作原理。

我们将区分向量存储（用于索引（子）文档的嵌入向量）和文档存储（用于存放“父”文档，并将其与标识符关联）。

import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

doc_ids = [str(uuid.uuid4()) for _ in docs]

API 参考：MultiVectorRetriever

接下来，我们通过拆分原始文档来生成“子”文档。请注意，我们将文档标识符存储在相应 Document 对象的 metadata 中。

# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

最后，我们将文档索引到向量存储和文档存储中：

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

仅向量存储将检索出小块数据：

retriever.vectorstore.similarity_search("justice breyer")[0]

Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '064eca46-a4c4-4789-8e3b-583f9597e54f', 'source': 'state_of_the_union.txt'})

而检索器将返回较大的父文档：

len(retriever.invoke("justice breyer")[0].page_content)

检索器在向量数据库上执行的默认搜索类型是相似性搜索。LangChain 向量存储还支持通过最大边际相关性进行搜索。这可以通过检索器的 search_type 参数进行控制：

from langchain.retrievers.multi_vector import SearchType

retriever.search_type = SearchType.mmr

len(retriever.invoke("justice breyer")[0].page_content)

API 参考：SearchType

将摘要与文档关联以用于检索

摘要可能能够更准确地提炼出文本块的核心内容，从而提升检索效果。本文将展示如何创建摘要，然后对这些摘要进行嵌入。

我们构建一个简单的 Chains，它将接收一个输入文档对象，并使用大型语言模型生成摘要。

选择聊天模型:

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm
    | StrOutputParser()
)

API 参考：文档 |StrOutputParser | ChatPromptTemplate

请注意，我们可以将链跨文档进行批量处理：

summaries = chain.batch(docs, {"max_concurrency": 5})

然后，我们可以像之前一样初始化一个 MultiVectorRetriever，将摘要索引到我们的向量存储中，并在文档存储中保留原始文档：

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# # We can also add the original chunks to the vectorstore if we so want
# for i, doc in enumerate(docs):
#     doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)

查询向量存储将返回摘要：

sub_docs = retriever.vectorstore.similarity_search("justice breyer")

sub_docs[0]

Document(page_content="President Biden recently nominated Judge Ketanji Brown Jackson to serve on the United States Supreme Court, emphasizing her qualifications and broad support. The President also outlined a plan to secure the border, fix the immigration system, protect women's rights, support LGBTQ+ Americans, and advance mental health services. He highlighted the importance of bipartisan unity in passing legislation, such as the Violence Against Women Act. The President also addressed supporting veterans, particularly those impacted by exposure to burn pits, and announced plans to expand benefits for veterans with respiratory cancers. Additionally, he proposed a plan to end cancer as we know it through the Cancer Moonshot initiative. President Biden expressed optimism about the future of America and emphasized the strength of the American people in overcoming challenges.", metadata={'doc_id': '84015b1b-980e-400a-94d8-cf95d7e079bd'})

而检索器将返回较大的源文档：

retrieved_docs = retriever.invoke("justice breyer")

len(retrieved_docs[0].page_content)

假设性查询

大型语言模型还可以用于生成可能针对特定文档提出的一系列假设性问题，这些问题在语义上可能与 RAG 应用中的相关查询非常相似。这些问题随后可以被嵌入并关联到文档中，以提升检索效果。

下方，我们使用 with_structured_output 方法将 LLM 输出结构化为字符串列表。

from typing import List

from pydantic import BaseModel, Field


class HypotheticalQuestions(BaseModel):
    """Generate hypothetical questions."""

    questions: List[str] = Field(..., description="List of questions")


chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4o").with_structured_output(
        HypotheticalQuestions
    )
    | (lambda x: x.questions)
)

对单个文档调用该链，可证明其输出为一个问题列表：

chain.invoke(docs[0])

["What impact did the IBM 1401 have on the author's early programming experiences?",
 "How did the transition from using the IBM 1401 to microcomputers influence the author's programming journey?",
 "What role did Lisp play in shaping the author's understanding and approach to AI?"]

我们可以批量处理，将链应用于所有文档，并像之前一样组装我们的向量存储和文档存储：

# Batch chain over documents to generate hypothetical questions
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})


# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]


# Generate Document objects from hypothetical questions
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )


retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

请注意，查询底层向量存储将检索出与输入查询语义相似的假设性问题：

sub_docs = retriever.vectorstore.similarity_search("justice breyer")

sub_docs

[Document(page_content='What might be the potential benefits of nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to the United States Supreme Court?', metadata={'doc_id': '43292b74-d1b8-4200-8a8b-ea0cb57fbcdb'}),
 Document(page_content='How might the Bipartisan Infrastructure Law impact the economic competition between the U.S. and China?', metadata={'doc_id': '66174780-d00c-4166-9791-f0069846e734'}),
 Document(page_content='What factors led to the creation of Y Combinator?', metadata={'doc_id': '72003c4e-4cc9-4f09-a787-0b541a65b38c'}),
 Document(page_content='How did the ability to publish essays online change the landscape for writers and thinkers?', metadata={'doc_id': 'e8d2c648-f245-4bcc-b8d3-14e64a164b64'})]

调用检索器将返回相应的文档：

retrieved_docs = retriever.invoke("justice breyer")
len(retrieved_docs[0].page_content)

较小的块​

将摘要与文档关联以用于检索​

假设性查询​

较小的块

将摘要与文档关联以用于检索

假设性查询