Skip to main content
Open In ColabOpen on GitHub

MongoDB云数据库

本笔记本介绍了如何在 LangChain 中使用 MongoDB Atlas 向量搜索,利用 langchain-mongodb 包。

MongoDB Atlas is a fully-managed cloud database available in AWS, Azure, and GCP. It supports native Vector Search, full text search (BM25), and hybrid search on your MongoDB document data.

MongoDB Atlas Vector Search allows to store your embeddings in MongoDB documents, create a vector search index, and perform KNN search with an approximate nearest neighbor algorithm (Hierarchical Navigable Small Worlds). It uses the $vectorSearch MQL Stage.

设置

*An Atlas cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs).

要使用 MongoDB Atlas,您必须首先部署一个集群。我们提供了一个免费的集群层级,您可以选择在任何云上部署。要开始,请访问 Atlas: 快速入门

你需要安装 langchain-mongodbpymongo 才能使用此集成。

pip install -qU langchain-mongodb pymongo

凭据

在这个笔记本中,你需要找到你的 MongoDB 集群 URI。

有关查找集群URI的信息,请阅读 此指南

import getpass

MONGODB_ATLAS_CLUSTER_URI = getpass.getpass("MongoDB Atlas Cluster URI:")

如果您希望获得一流的模型调用自动追踪功能,还可以通过取消注释以下代码来设置您的 LangSmith API 密钥:

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

初始化

pip install -qU langchain-openai
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
from langchain_mongodb import MongoDBAtlasVectorSearch
from pymongo import MongoClient

# initialize MongoDB python client
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)

DB_NAME = "langchain_test_db"
COLLECTION_NAME = "langchain_test_vectorstores"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "langchain-test-index-vectorstores"

MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

vector_store = MongoDBAtlasVectorSearch(
collection=MONGODB_COLLECTION,
embedding=embeddings,
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
relevance_score_fn="cosine",
)

# Create vector search index on the collection
# Since we are using the default OpenAI embedding model (ada-v2) we need to specify the dimensions as 1536
vector_store.create_vector_search_index(dimensions=1536)

[可选] 除了上面的vector_store.create_vector_search_index命令之外,您还可以使用以下索引定义通过Atlas UI创建向量搜索索引:

{
"fields":[
{
"type": "vector",
"path": "embedding",
"numDimensions": 1536,
"similarity": "cosine"
}
]
}

管理向量存储

创建向量存储后,我们可以通过添加和删除不同条目来与其交互。

将项目添加到向量存储

我们可以通过使用add_documents函数将项目添加到我们的向量存储中。

from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
metadata={"source": "tweet"},
)

document_2 = Document(
page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
metadata={"source": "news"},
)

document_3 = Document(
page_content="Building an exciting new project with LangChain - come check it out!",
metadata={"source": "tweet"},
)

document_4 = Document(
page_content="Robbers broke into the city bank and stole $1 million in cash.",
metadata={"source": "news"},
)

document_5 = Document(
page_content="Wow! That was an amazing movie. I can't wait to see it again.",
metadata={"source": "tweet"},
)

document_6 = Document(
page_content="Is the new iPhone worth the price? Read this review to find out.",
metadata={"source": "website"},
)

document_7 = Document(
page_content="The top 10 soccer players in the world right now.",
metadata={"source": "website"},
)

document_8 = Document(
page_content="LangGraph is the best framework for building stateful, agentic applications!",
metadata={"source": "tweet"},
)

document_9 = Document(
page_content="The stock market is down 500 points today due to fears of a recession.",
metadata={"source": "news"},
)

document_10 = Document(
page_content="I have a bad feeling I am going to get deleted :(",
metadata={"source": "tweet"},
)

documents = [
document_1,
document_2,
document_3,
document_4,
document_5,
document_6,
document_7,
document_8,
document_9,
document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)
API 参考:文档
['03ad81e8-32a0-46f0-b7d8-f5b977a6b52a',
'8396a68d-f4a3-4176-a581-a1a8c303eea4',
'e7d95150-67f6-499f-b611-84367c50fa60',
'8c31b84e-2636-48b6-8b99-9fccb47f7051',
'aa02e8a2-a811-446a-9785-8cea0faba7a9',
'19bd72ff-9766-4c3b-b1fd-195c732c562b',
'642d6f2f-3e34-4efa-a1ed-c4ba4ef0da8d',
'7614bb54-4eb5-4b3b-990c-00e35cb31f99',
'69e18c67-bf1b-43e5-8a6e-64fb3f240e52',
'30d599a7-4a1a-47a9-bbf8-6ed393e2e33c']

从向量存储中删除项目

vector_store.delete(ids=[uuids[-1]])
True

查询向量存储

一旦您的向量存储已创建并添加了相关文档,您很可能希望在链或代理运行期间对其进行查询。

直接查询

执行简单的相似度搜索可以按以下方式完成:

results = vector_store.similarity_search(
"LangChain provides abstractions to make working with LLMs easy", k=2
)
for res in results:
print(f"* {res.page_content} [{res.metadata}]")
* Building an exciting new project with LangChain - come check it out! [{'_id': 'e7d95150-67f6-499f-b611-84367c50fa60', 'source': 'tweet'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'_id': '7614bb54-4eb5-4b3b-990c-00e35cb31f99', 'source': 'tweet'}]

带分数的相似性搜索

您也可以按分数搜索:

results = vector_store.similarity_search_with_score("Will it be hot tomorrow?", k=1)
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
* [SIM=0.784560] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'_id': '8396a68d-f4a3-4176-a581-a1a8c303eea4', 'source': 'news'}]

Atlas 向量搜索支持使用 MQL 操作符进行预过滤。以下是在上述数据加载后的示例索引和查询,允许您对“page”字段进行元数据过滤。您可以更新现有索引以包含定义的过滤条件,并在向量搜索中进行预过滤。

要启用预过滤,您需要更新索引定义以包含一个过滤字段。在此示例中,我们将使用 source 字段作为过滤字段。

这可以通过使用 MongoDBAtlasVectorSearch.create_vector_search_index 方法以编程方式完成。

vectorstore.create_vector_search_index(
dimensions=1536,
filters=[{"type":"filter", "path":"source"}],
update=True
)

或者,您也可以使用Atlas UI,通过以下索引定义来更新索引:

{
"fields":[
{
"type": "vector",
"path": "embedding",
"numDimensions": 1536,
"similarity": "cosine"
},
{
"type": "filter",
"path": "source"
}
]
}

然后你可以使用过滤器运行查询,如下所示:

results = vector_store.similarity_search(query="foo", k=1, pre_filter={"source": {"$eq": "https://example.com"}})
for doc in results:
print(f"* {doc.page_content} [{doc.metadata}]")

其他搜索方法

还有许多其他搜索方法未在本笔记本中涵盖,例如 MMR 搜索或按向量搜索。如需查看MongoDBAtlasVectorStore可用的完整搜索功能列表,请参阅API 参考

通过转换为检索器进行查询

您还可以将向量存储转换为检索器,以便在链中更轻松地使用。

以下是如何将您的向量存储转换为检索器,然后使用简单查询和过滤器调用该检索器的方法。

retriever = vector_store.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"k": 1, "score_threshold": 0.2},
)
retriever.invoke("Stealing from the bank is a crime")
[Document(metadata={'_id': '8c31b84e-2636-48b6-8b99-9fccb47f7051', 'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]

检索增强生成的用法

有关如何使用此向量存储进行检索增强生成 (RAG) 的指南,请参阅以下部分:

其他备注

  • More documentation can be found at MongoDB's LangChain Docs site
  • This feature is Generally Available and ready for production deployments.
  • The langchain version 0.0.305 (release notes) introduces the support for $vectorSearch MQL stage, which is available with MongoDB Atlas 6.0.11 and 7.0.2. Users utilizing earlier versions of MongoDB Atlas need to pin their LangChain version to <=0.0.304

API 参考

有关所有 MongoDBAtlasVectorSearch 功能和配置的详细文档,请访问 API 参考: https://python.langchain.com/api_reference/mongodb/index.html