如何使用 LangChain 索引 API

在这里，我们将通过 LangChain 索引 API 了解一个基本的索引工作流程。

索引API 允许您从任何来源加载文档并将其与向量存储保持同步。具体来说，它有助于：

避免将重复内容写入向量存储
避免重写未更改的内容
避免对未更改的内容重复计算嵌入

所有这些都将为您节省时间和金钱，同时提升您的向量搜索结果。

至关重要的是，索引API即使在文档经过多个转换步骤（例如通过文本分块）后，仍能针对原始源文档正常工作。

工作原理

LangChain 索引使用记录管理器（RecordManager），用于跟踪文档写入向量存储的过程。

在索引内容时，会为每个文档计算哈希值，并将以下信息存储在记录管理器中：

文档哈希值（页面内容和元数据的哈希值）
写入时间
源 ID——每个文档应包含元数据信息，以便我们确定该文档的最终来源

删除模式

在将文档索引到向量存储时，有可能需要删除向量存储中的一些现有文档。在某些情况下，您可能希望移除那些与正在索引的新文档源自相同来源的现有文档。而在其他情况下，您可能希望一次性删除所有现有文档。索引 API 的删除模式可让您选择所需的处理行为：

清理模式	去重内容	可并行化	清理已删除的源文档	清理源文档和/或派生文档的变更	清理时间
None	✅	✅	❌	❌	-
Incremental	✅	✅	❌	✅	Continuously
Full	✅	❌	✅	✅	At end of indexing
Scoped_Full	✅	✅	❌	✅	At end of indexing

None 不会进行任何自动清理，允许用户手动清理旧内容。

incremental、full 和 scoped_full 提供以下自动清理功能：

如果源文档或衍生文档的内容已更改，所有3种模式都将清理（删除）之前版本的内容。
如果源文档已被删除（即该文档未包含在当前正在索引的文档中），则 full 清理模式会正确地将其从向量存储中删除，但 incremental 和 scoped_full 模式不会。

当内容发生更改时（例如，源PDF文件已更新），在索引过程中会有一段时间，用户可能会同时收到新旧两个版本的内容。这种情况发生在新内容写入之后，但旧版本尚未删除之前。

incremental 索引最大限度地减少了这一时间段，因为它能够在写入时持续进行清理。
full 和 scoped_full 模式在所有批次写入完成后进行清理。

要求

请勿与已通过独立于索引API的方式预先填充内容的存储一起使用，因为记录管理器将不知道这些记录之前已被插入。
仅适用于 LangChainvectorstore支持：
- 通过ID添加文档（add_documents 方法，ids 个参数）
- 通过ID删除（使用方法delete，参数为ids）

兼容的向量存储： Aerospike, AnalyticDB, AstraDB, AwaDB, AzureCosmosDBNoSqlVectorSearch, AzureCosmosDBVectorSearch, AzureSearch, Bagel, Cassandra, Chroma, CouchbaseVectorStore, DashVector, DatabricksVectorSearch, DeepLake, Dingo, ElasticVectorSearch, ElasticsearchStore, FAISS, HanaDB, Milvus, MongoDBAtlasVectorSearch, MyScale, OpenSearchVectorSearch, PGVector, Pinecone, Qdrant, Redis, Rockset, ScaNN, SingleStoreDB, SupabaseVectorStore, SurrealDBStore, TimescaleVector, Vald, VDMS, Vearch, VespaStore, Weaviate, Yellowbrick, ZepVectorStore, TencentVectorDB, OpenSearchVectorSearch.

注意

记录管理器依赖于基于时间的机制来确定哪些内容可以被清理（在使用 full、incremental 或 scoped_full 清理模式时）。

如果两个任务连续运行，且第一个任务在时钟时间改变之前完成，则第二个任务可能无法清理内容。

在实际应用中，出现这种情况的可能性很小，原因如下：

RecordManager 使用更高分辨率的时间戳。
在第一次和第二次任务运行之间，数据需要发生变化，但如果两次任务之间的时间间隔较短，这种情况就变得不太可能。
索引任务通常需要超过几毫秒。

快速入门

from langchain.indexes import SQLRecordManager, index
from langchain_core.documents import Document
from langchain_elasticsearch import ElasticsearchStore
from langchain_openai import OpenAIEmbeddings

API 参考：SQLRecordManager | 首页 |文档 |ElasticsearchStore | OpenAIEmbeddings

初始化向量存储并设置嵌入：

collection_name = "test_index"

embedding = OpenAIEmbeddings()

vectorstore = ElasticsearchStore(
    es_url="http://localhost:9200", index_name="test_index", embedding=embedding
)

使用适当的命名空间初始化一个记录管理器。

建议： 使用一个命名空间，该命名空间应同时考虑向量存储和向量存储中的集合名称；例如 'redis/my_docs'、'chromadb/my_docs' 或 'postgres/my_docs'。

namespace = f"elasticsearch/{collection_name}"
record_manager = SQLRecordManager(
    namespace, db_url="sqlite:///record_manager_cache.sql"
)

在使用记录管理器之前，请先创建模式。

record_manager.create_schema()

让我们索引一些测试文档：

doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"})
doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})

对空向量存储进行索引：

def _clear():
    """Hacky helper method to clear content. See the `full` mode section to to understand why it works."""
    index([], record_manager, vectorstore, cleanup="full", source_id_key="source")

`None` 删除模式

此模式不会自动清理旧版本的内容；但仍然会处理内容去重。

_clear()

index(
    [doc1, doc1, doc1, doc1, doc1],
    record_manager,
    vectorstore,
    cleanup=None,
    source_id_key="source",
)

{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

_clear()

index([doc1, doc2], record_manager, vectorstore, cleanup=None, source_id_key="source")

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

第二次及以后的处理将跳过所有内容：

index([doc1, doc2], record_manager, vectorstore, cleanup=None, source_id_key="source")

{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

`"incremental"` 删除模式

_clear()

index(
    [doc1, doc2],
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

重新索引应导致两个文档都被跳过——同时也会跳过嵌入操作！

index(
    [doc1, doc2],
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

如果我们不提供文档进行增量索引模式，将不会有任何变化。

index([], record_manager, vectorstore, cleanup="incremental", source_id_key="source")

{'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

如果我们修改了某个文档，新版本将被写入，所有共享同一源文件的旧版本都将被删除。

changed_doc_2 = Document(page_content="puppy", metadata={"source": "doggy.txt"})

index(
    [changed_doc_2],
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 1}

`"full"` 删除模式

在 full 模式下，用户应传递需要索引到索引函数中的 full 内容宇宙。

任何未传递给索引函数但存在于向量存储中的文档都将被删除！

这种行为有助于处理源文档的删除。

_clear()

all_docs = [doc1, doc2]

index(all_docs, record_manager, vectorstore, cleanup="full", source_id_key="source")

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

假设有人删除了第一个文档：

del all_docs[0]

all_docs

[Document(page_content='doggy', metadata={'source': 'doggy.txt'})]

使用完整模式将同时清理已删除的内容。

index(all_docs, record_manager, vectorstore, cleanup="full", source_id_key="source")

{'num_added': 0, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 1}

源

元数据属性包含一个名为 source 的字段。此源应指向与给定文档相关的最终来源。

例如，如果这些文档代表某个父文档的片段，则两个文档的 source 应该相同，并引用父文档。

通常情况下，应始终指定 source。只有在您从不打算使用 incremental 模式，并且由于某种原因无法正确指定 source 字段时，才使用 None。

from langchain_text_splitters import CharacterTextSplitter

API 参考：CharacterTextSplitter

doc1 = Document(
    page_content="kitty kitty kitty kitty kitty", metadata={"source": "kitty.txt"}
)
doc2 = Document(page_content="doggy doggy the doggy", metadata={"source": "doggy.txt"})

new_docs = CharacterTextSplitter(
    separator="t", keep_separator=True, chunk_size=12, chunk_overlap=2
).split_documents([doc1, doc2])
new_docs

[Document(page_content='kitty kit', metadata={'source': 'kitty.txt'}),
 Document(page_content='tty kitty ki', metadata={'source': 'kitty.txt'}),
 Document(page_content='tty kitty', metadata={'source': 'kitty.txt'}),
 Document(page_content='doggy doggy', metadata={'source': 'doggy.txt'}),
 Document(page_content='the doggy', metadata={'source': 'doggy.txt'})]

_clear()

index(
    new_docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

{'num_added': 5, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

changed_doggy_docs = [
    Document(page_content="woof woof", metadata={"source": "doggy.txt"}),
    Document(page_content="woof woof woof", metadata={"source": "doggy.txt"}),
]

这将删除与 doggy.txt 源关联的旧文档版本，并用新版本替换它们。

index(
    changed_doggy_docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 2}

vectorstore.similarity_search("dog", k=30)

[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),
 Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'}),
 Document(page_content='tty kitty', metadata={'source': 'kitty.txt'}),
 Document(page_content='tty kitty ki', metadata={'source': 'kitty.txt'}),
 Document(page_content='kitty kit', metadata={'source': 'kitty.txt'})]

与加载器一起使用

索引可以接受文档的可迭代对象，或者任何加载器。

注意：加载器必须正确设置源键。

from langchain_core.document_loaders import BaseLoader


class MyCustomLoader(BaseLoader):
    def lazy_load(self):
        text_splitter = CharacterTextSplitter(
            separator="t", keep_separator=True, chunk_size=12, chunk_overlap=2
        )
        docs = [
            Document(page_content="woof woof", metadata={"source": "doggy.txt"}),
            Document(page_content="woof woof woof", metadata={"source": "doggy.txt"}),
        ]
        yield from text_splitter.split_documents(docs)

    def load(self):
        return list(self.lazy_load())

API 参考：BaseLoader

_clear()

loader = MyCustomLoader()

loader.load()

[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),
 Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'})]

index(loader, record_manager, vectorstore, cleanup="full", source_id_key="source")

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

vectorstore.similarity_search("dog", k=30)

[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),
 Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'})]

工作原理​

删除模式​

要求​

注意​

快速入门​

None 删除模式​

"incremental" 删除模式​

"full" 删除模式​

源​

与加载器一起使用​