Activeloop Deep Lake
Activeloop Deep Lake as a Multi-Modal Vector Store that stores embeddings and their metadata including text, jsons, images, audio, video, and more. It saves the data locally, in your cloud, or on Activeloop storage. It performs hybrid search including embeddings and their attributes.
本笔记本展示了与Activeloop Deep Lake相关的基本功能。虽然Deep Lake可以存储嵌入向量,但它能够存储任何类型的数据。它是一个无服务器数据湖,具备版本控制、查询引擎以及用于深度学习框架的流式数据加载器。
更多信息,请参阅 Deep Lake 文档
设置
%pip install --upgrade --quiet langchain-openai langchain-deeplake tiktoken
示例由 Activeloop 提供
Deep Lake 本地
from langchain_deeplake.vectorstores import DeeplakeVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
if "ACTIVELOOP_TOKEN" not in os.environ:
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("activeloop token:")
from langchain_community.document_loaders import TextLoader
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
创建本地数据集
在 ./my_deeplake/ 处本地创建数据集,然后运行相似度搜索。Deeplake+LangChain 集成在底层使用 Deep Lake 数据集,因此 dataset 和 vector store 可互换使用。若要在您自己的云端或 Deep Lake 存储中创建数据集,请相应调整路径。
db = DeeplakeVectorStore(
dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True
)
db.add_documents(docs)
# or shorter
# db = DeepLake.from_documents(docs, dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True)
查询数据集
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
稍后,您可以重新加载数据集而无需重新计算嵌入。
db = DeeplakeVectorStore(
dataset_path="./my_deeplake/", embedding_function=embeddings, read_only=True
)
docs = db.similarity_search(query)
设置 read_only=True 可防止在不需要更新时意外修改向量存储。这确保数据保持不变,除非明确意图进行更改。通常建议指定此参数以避免意外更新。
检索问答
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-3.5-turbo"),
chain_type="stuff",
retriever=db.as_retriever(),
)
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)
基于属性的元数据过滤
让我们创建另一个包含文档创建年份元数据的向量存储。
import random
for d in docs:
d.metadata["year"] = random.randint(2012, 2014)
db = DeeplakeVectorStore.from_documents(
docs, embeddings, dataset_path="./my_deeplake/", overwrite=True
)
db.similarity_search(
"What did the president say about Ketanji Brown Jackson",
filter={"metadata": {"year": 2013}},
)
选择距离函数
距离函数:L2 表示欧几里得距离,cos 表示余弦相似度
db.similarity_search(
"What did the president say about Ketanji Brown Jackson?", distance_metric="l2"
)
最大边际相关性
使用最大边际相关性
db.max_marginal_relevance_search(
"What did the president say about Ketanji Brown Jackson?"
)
删除数据集
db.delete_dataset()
云端(Activeloop、AWS、GCS 等)或内存中的 Deep Lake 数据集
默认情况下,Deep Lake 数据集存储在本地。若要将它们存储在内存中、Deep Lake 托管数据库或任何对象存储中,您可以在创建向量存储时提供相应的路径和凭据。某些路径需要向 Activeloop 注册并创建一个 API 令牌,该令牌可在此处获取
os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
# Embed and store the texts
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
dataset_path = f"hub://{username}/langchain_testing_python" # could be also ./local/path (much faster locally), s3://bucket/path/to/dataset, gcs://path/to/dataset, etc.
docs = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore(
dataset_path=dataset_path, embedding_function=embeddings, overwrite=True
)
ids = db.add_documents(docs)
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
# Embed and store the texts
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
dataset_path = f"hub://{username}/langchain_testing"
docs = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore(
dataset_path=dataset_path,
embedding_function=embeddings,
overwrite=True,
)
ids = db.add_documents(docs)
TQL 搜索
此外,similarity_search 方法还支持查询的执行,其中可以使用 Deep Lake 的张量查询语言(TQL)来指定查询。
search_id = db.dataset["ids"][0]
docs = db.similarity_search(
query=None,
tql=f"SELECT * WHERE ids == '{search_id}'",
)
db.dataset.summary()
在 AWS S3 上创建向量存储
dataset_path = "s3://BUCKET/langchain_test" # could be also ./local/path (much faster locally), hub://bucket/path/to/dataset, gcs://path/to/dataset, etc.
embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore.from_documents(
docs,
dataset_path=dataset_path,
embedding=embeddings,
overwrite=True,
creds={
"aws_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
"aws_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"],
"aws_session_token": os.environ["AWS_SESSION_TOKEN"], # Optional
},
)
Deep Lake API
您可以在 db.vectorstore 访问 Deep Lake 数据集
# get structure of the dataset
db.dataset.summary()
# get embeddings numpy array
embeds = db.dataset["embeddings"][:]
将本地数据集传输到云端
将已创建的数据集复制到云端。您也可以从云端传输到本地。
import deeplake
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
source = f"hub://{username}/langchain_testing" # could be local, s3, gcs, etc.
destination = f"hub://{username}/langchain_test_copy" # could be local, s3, gcs, etc.
deeplake.copy(src=source, dst=destination)
db = DeeplakeVectorStore(dataset_path=destination, embedding_function=embeddings)
db.add_documents(docs)