Skip to main content
Open In ColabOpen on GitHub

Couchbase

Couchbase 屡获殊荣的分布式 NoSQL 云数据库,为您的所有云、移动、AI 和边缘计算应用提供无与伦比的灵活性、性能、可扩展性和财务价值。Couchbase 通过为开发人员提供编码辅助以及为应用程序提供向量搜索来拥抱 AI。

向量搜索是 Couchbase 中全文搜索服务(搜索服务)的一部分。

本教程介绍如何在 Couchbase 中使用向量搜索。您可以使用 Couchbase Capella 或自行管理的 Couchbase Server。

设置

要访问CouchbaseSearchVectorStore,您首先需要安装langchain-couchbase合作伙伴包:

pip install -qU langchain-couchbase

[notice] A new release of pip is available: 24.1.2 -> 25.0.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

凭据

前往 Couchbase 网站 创建新连接,并确保保存您的数据库用户名和密码:

import getpass

COUCHBASE_CONNECTION_STRING = getpass.getpass(
"Enter the connection string for the Couchbase cluster: "
)
DB_USERNAME = getpass.getpass("Enter the username for the Couchbase cluster: ")
DB_PASSWORD = getpass.getpass("Enter the password for the Couchbase cluster: ")
Enter the connection string for the Couchbase cluster:  ········
Enter the username for the Couchbase cluster: ········
Enter the password for the Couchbase cluster: ········

如果您希望获得一流的模型调用自动追踪功能,还可以通过取消注释以下代码来设置您的 LangSmith API 密钥:

# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

初始化

在实例化之前,我们需要先建立连接。

创建 Couchbase 连接对象

我们首先创建到 Couchbase 集群的连接,然后将集群对象传递给向量存储。

在这里,我们使用上述用户名和密码进行连接。您也可以使用其他任何受支持的方式连接到您的集群。

有关连接到 Couchbase 集群的更多信息,请查看文档

from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# Wait until the cluster is ready for use.
cluster.wait_until_ready(timedelta(seconds=5))

我们现在将在 Couchbase 集群中设置用于向量搜索的存储桶、作用域和集合名称。

对于此示例,我们使用的是默认的作用域和集合。

BUCKET_NAME = "langchain_bucket"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "langchain-test-index"

有关如何创建支持向量字段的搜索索引的详细信息,请参阅文档。

简单实例化

下面,我们将使用集群信息和搜索索引名称创建向量存储对象。

pip install -qU langchain-openai
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore

vector_store = CouchbaseSearchVectorStore(
cluster=cluster,
bucket_name=BUCKET_NAME,
scope_name=SCOPE_NAME,
collection_name=COLLECTION_NAME,
embedding=embeddings,
index_name=SEARCH_INDEX_NAME,
)

指定文本和嵌入字段

您可以使用 text_keyembedding_key 字段为文档可选地指定文本和嵌入字段。

vector_store_specific = CouchbaseSearchVectorStore(
cluster=cluster,
bucket_name=BUCKET_NAME,
scope_name=SCOPE_NAME,
collection_name=COLLECTION_NAME,
embedding=embeddings,
index_name=SEARCH_INDEX_NAME,
text_key="text",
embedding_key="embedding",
)

管理向量存储

创建向量存储后,我们可以通过添加和删除不同条目来与其交互。

将项目添加到向量存储

我们可以通过使用add_documents函数将项目添加到我们的向量存储中。

from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
metadata={"source": "tweet"},
)

document_2 = Document(
page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
metadata={"source": "news"},
)

document_3 = Document(
page_content="Building an exciting new project with LangChain - come check it out!",
metadata={"source": "tweet"},
)

document_4 = Document(
page_content="Robbers broke into the city bank and stole $1 million in cash.",
metadata={"source": "news"},
)

document_5 = Document(
page_content="Wow! That was an amazing movie. I can't wait to see it again.",
metadata={"source": "tweet"},
)

document_6 = Document(
page_content="Is the new iPhone worth the price? Read this review to find out.",
metadata={"source": "website"},
)

document_7 = Document(
page_content="The top 10 soccer players in the world right now.",
metadata={"source": "website"},
)

document_8 = Document(
page_content="LangGraph is the best framework for building stateful, agentic applications!",
metadata={"source": "tweet"},
)

document_9 = Document(
page_content="The stock market is down 500 points today due to fears of a recession.",
metadata={"source": "news"},
)

document_10 = Document(
page_content="I have a bad feeling I am going to get deleted :(",
metadata={"source": "tweet"},
)

documents = [
document_1,
document_2,
document_3,
document_4,
document_5,
document_6,
document_7,
document_8,
document_9,
document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)
API 参考:文档
['4a6b5252-24ca-4e48-97a9-c33211fc7736',
'594a413d-761a-44f1-8f0c-6418700b198d',
'fdd8461c-f4e3-4c85-af8e-7782ce4d2311',
'3f6a82b2-7464-4eee-b209-cbca5a236a8a',
'df8b87ad-464e-4f83-a007-ccf5a8fa4ff5',
'aa18502e-6fb4-4578-9c63-b9a299259b01',
'8c55a17d-5fa7-4c30-a55d-7ded0d39bf46',
'41b68c5a-ebf5-4d7a-a079-5e32926ca484',
'146ac3e0-474a-422a-b0ac-c9fee718396b',
'e44941e9-fb3a-4090-88a0-9ffecee3e80e']

从向量存储中删除项目

vector_store.delete(ids=[uuids[-1]])
True

查询向量存储

一旦您的向量存储已创建并添加了相关文档,您很可能希望在链或代理运行期间对其进行查询。

直接查询

执行简单的相似度搜索可以按以下方式完成:

results = vector_store.similarity_search(
"LangChain provides abstractions to make working with LLMs easy",
k=2,
)
for res in results:
print(f"* {res.page_content} [{res.metadata}]")
* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]

带分数的相似性搜索

您还可以通过调用 similarity_search_with_score 方法来获取结果的分数。

results = vector_store.similarity_search_with_score("Will it be hot tomorrow?", k=1)
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
* [SIM=0.553145] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]

指定要返回的字段

您可以使用搜索中的 fields 参数指定要从文档返回的字段。这些字段作为返回的 Document 对象中 metadata 的一部分返回。您可以获取存储在搜索索引中的任何字段。文档的 text_key 将作为文档 page_content 的一部分返回。

如果您未指定要获取的任何字段,将返回索引中存储的所有字段。

如果您想获取元数据中的某个字段,需要使用 . 来指定它

例如,要获取元数据中的 source 字段,您需要指定 metadata.source

query = "What did I eat for breakfast today?"
results = vector_store.similarity_search(query, fields=["metadata.source"])
print(results[0])
page_content='I had chocolate chip pancakes and scrambled eggs for breakfast this morning.' metadata={'source': 'tweet'}

混合查询

Couchbase 允许您通过将向量搜索结果与文档中非向量字段(如 metadata 对象)的搜索相结合,来执行混合搜索。

结果将基于向量搜索和搜索服务支持的搜索两者的结果组合。每个组件搜索的得分相加,得到结果的总得分。

要执行混合搜索,有一个可选参数,search_options可以传递给所有相似性搜索的参数。
不同的搜索/查询方式search_options可以找到这里.

为了模拟混合搜索,让我们从现有文档中创建一些随机元数据。 我们均匀地向元数据添加三个字段,date 的范围在 2010 到 2020 之间,rating 的范围在 1 到 5 之间,author 设置为 John Doe 或 Jane Doe。

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Adding metadata to documents
for i, doc in enumerate(docs):
doc.metadata["date"] = f"{range(2010, 2020)[i % 10]}-01-01"
doc.metadata["rating"] = range(1, 6)[i % 5]
doc.metadata["author"] = ["John Doe", "Jane Doe"][i % 2]

vector_store.add_documents(docs)

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(query)
print(results[0].metadata)
{'author': 'John Doe', 'date': '2016-01-01', 'rating': 2, 'source': '../../how_to/state_of_the_union.txt'}

按精确值查询

我们可以在文本字段(如metadata对象中的作者)上搜索精确匹配。

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(
query,
search_options={"query": {"field": "metadata.author", "match": "John Doe"}},
fields=["metadata.author"],
)
print(results[0])
page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.' metadata={'author': 'John Doe'}

按部分匹配查询

我们可以通过指定模糊度来进行部分匹配搜索。这在您想要搜索查询的轻微变化或拼写错误时非常有用。

在这里,“Jae”与“Jane”接近(模糊度为1)。

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(
query,
search_options={
"query": {"field": "metadata.author", "match": "Jae", "fuzziness": 1}
},
fields=["metadata.author"],
)
print(results[0])
page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. 

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.' metadata={'author': 'Jane Doe'}

按日期范围查询

我们可以搜索日期字段(如 metadata.date)中在指定日期范围内的文档。

query = "Any mention about independence?"
results = vector_store.similarity_search(
query,
search_options={
"query": {
"start": "2016-12-31",
"end": "2017-01-02",
"inclusive_start": True,
"inclusive_end": False,
"field": "metadata.date",
}
},
)
print(results[0])
page_content='We are cutting off Russia’s largest banks from the international financial system.  

Preventing Russia’s central bank from defending the Russian Ruble making Putin’s $630 Billion “war fund” worthless.

We are choking off Russia’s access to technology that will sap its economic strength and weaken its military for years to come.

Tonight I say to the Russian oligarchs and corrupt leaders who have bilked billions of dollars off this violent regime no more.' metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}

按数值范围查询

我们可以搜索数值字段在某个范围内的文档,例如 metadata.rating

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
query,
search_options={
"query": {
"min": 3,
"max": 5,
"inclusive_min": True,
"inclusive_max": True,
"field": "metadata.rating",
}
},
)
print(results[0])
(Document(id='8616f24425b94a52af3d32d20e6ffb4b', metadata={'author': 'John Doe', 'date': '2014-01-01', 'rating': 5, 'source': '../../how_to/state_of_the_union.txt'}, page_content='In this Capitol, generation after generation, Americans have debated great questions amid great strife, and have done great things. \n\nWe have fought for freedom, expanded liberty, defeated totalitarianism and terror. \n\nAnd built the strongest, freest, and most prosperous nation the world has ever known. \n\nNow is the hour. \n\nOur moment of responsibility. \n\nOur test of resolve and conscience, of history itself.'), 0.361933544533826)

结合多个搜索查询

不同的搜索查询可以使用 AND(合取)或 OR(析取)运算符进行组合。

在这个例子中,我们检查评分在3到4之间且日期在2015到2018年之间的文档。

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
query,
search_options={
"query": {
"conjuncts": [
{"min": 3, "max": 4, "inclusive_max": True, "field": "metadata.rating"},
{"start": "2016-12-31", "end": "2017-01-02", "field": "metadata.date"},
]
}
},
)
print(results[0])
(Document(id='d9b36ef70b8942dda4db63563f51cf0f', metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}, page_content='We are cutting off Russia’s largest banks from the international financial system.  \n\nPreventing Russia’s central bank from defending the Russian Ruble making Putin’s $630 Billion “war fund” worthless.   \n\nWe are choking off Russia’s access to technology that will sap its economic strength and weaken its military for years to come.  \n\nTonight I say to the Russian oligarchs and corrupt leaders who have bilked billions of dollars off this violent regime no more.'), 0.7107075545629284)

其他查询

同样,您可以在 search_options 参数中使用任何支持的查询方法,例如地理距离、多边形搜索、通配符、正则表达式等。有关可用查询方法及其语法的更多详细信息,请参阅文档。

通过转换为检索器进行查询

您还可以将向量存储转换为检索器,以便在链中更轻松地使用。

以下是如何将您的向量存储转换为检索器,然后使用简单查询和过滤器调用该检索器的方法。

retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 1, "score_threshold": 0.5},
)
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})
[Document(id='3f6a82b2-7464-4eee-b209-cbca5a236a8a', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]

检索增强生成的用法

有关如何使用此向量存储进行检索增强生成 (RAG) 的指南,请参阅以下部分:

常见问题

问题:在创建 CouchbaseVectorStore 对象之前,我应该先创建搜索索引吗?

是的,目前您需要在创建 CouchbaseVectoreStore 对象之前先创建搜索索引。

问题:我在搜索结果中没有看到我指定的所有字段。

在 Couchbase 中,我们只能返回存储在搜索索引中的字段。请确保您尝试在搜索结果中访问的字段是搜索索引的一部分。

一种处理方法是动态地在索引中对文档的字段进行索引和存储。

  • 在Capella中,你需要进入“高级模式”,然后在箭头“常规设置”下,可以勾选“[X] 存储动态字段”或“[X] 索引动态字段”。
  • 在 Couchbase Server 中,在索引编辑器(非快速编辑器)下的“高级”箭头下,您可以勾选“[X] 存储动态字段”或“[X] 索引动态字段”。

请注意,这些选项会增加索引的大小。

有关动态映射的更多详细信息,请参阅文档

问题:我在搜索结果中无法看到元数据对象。

这很可能是因为文档中的 metadata 字段未被 Couchbase 搜索索引索引和/或存储。为了对文档中的 metadata 字段进行索引,你需要将其添加到索引中作为子映射。

如果您选择映射所有字段,您将能够通过所有元数据字段进行搜索。或者,为了优化索引,您可以选择 metadata 对象内的特定字段进行索引。您可以参考 文档 以了解更多关于索引子映射的信息。

创建子映射

API 参考

有关所有 CouchbaseSearchVectorStore 功能和配置的详细文档,请访问 API 参考: https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html