Skip to main content
Open In ColabOpen on GitHub

Pebblo 安全文档加载器

Pebblo enables developers to safely load data and promote their Gen AI app to deployment without worrying about the organization’s compliance and security requirements. The project identifies semantic topics and entities found in the loaded data and summarizes them on the UI or a PDF report.

Pebblo 包含两个组件。

  1. Pebblo 安全文档加载器(适用于 LangChain)
  2. Pebblo 服务器

本文档介绍如何通过 Pebblo Safe DocumentLoader 增强您现有的 Langchain DocumentLoader,从而深入洞察被摄入到 Gen-AI Langchain 应用中的主题和实体类型。有关Pebblo Server的详细信息,请参阅此pebblo server文档。

Pebblo Safeloader 为 Langchain DocumentLoader 提供安全的数据导入功能。这是通过将文档加载器调用与 Pebblo Safe DocumentLoader 包装来实现的。

注意:若要在除 Pebblo 默认地址(localhost:8000)之外的其他 URL 上配置 Pebblo 服务器,请在 PEBBLO_CLASSIFIER_URL 环境变量中填入正确的 URL。此配置也可通过 classifier_url 关键字参数完成。参考:server-configurations

如何启用 Pebblo 文档加载?

假设一个使用 CSVLoader 读取 CSV 文档以进行推理的 LangChain RAG 应用程序片段。

这是使用 CSVLoader 进行文档加载的代码片段。

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("data/corp_sens_data.csv")
documents = loader.load()
print(documents)
API 参考:CSVLoader

只需对上述代码片段进行几行修改,即可启用 Pebblo SafeLoader。

from langchain_community.document_loaders import CSVLoader, PebbloSafeLoader

loader = PebbloSafeLoader(
CSVLoader("data/corp_sens_data.csv"),
name="acme-corp-rag-1", # App name (Mandatory)
owner="Joe Smith", # Owner (Optional)
description="Support productivity RAG application", # Description (Optional)
)
documents = loader.load()
print(documents)

将语义主题和身份发送到 Pebblo 云服务器

要将语义数据发送到 pebblo-cloud,请将 api-key 作为参数传递给 PebbloSafeLoader,或者将 api-key 放入 PEBBLO_API_KEY 环境变量中。

from langchain_community.document_loaders import CSVLoader, PebbloSafeLoader

loader = PebbloSafeLoader(
CSVLoader("data/corp_sens_data.csv"),
name="acme-corp-rag-1", # App name (Mandatory)
owner="Joe Smith", # Owner (Optional)
description="Support productivity RAG application", # Description (Optional)
api_key="my-api-key", # API key (Optional, can be set in the environment variable PEBBLO_API_KEY)
)
documents = loader.load()
print(documents)

为加载的元数据添加语义主题和身份

要将语义主题和语义实体添加到已加载文档的元数据中,请将 load_semantic 参数设置为 True,或者定义一个新的环境变量 PEBBLO_LOAD_SEMANTIC 并将其设置为 True。

from langchain_community.document_loaders import CSVLoader, PebbloSafeLoader

loader = PebbloSafeLoader(
CSVLoader("data/corp_sens_data.csv"),
name="acme-corp-rag-1", # App name (Mandatory)
owner="Joe Smith", # Owner (Optional)
description="Support productivity RAG application", # Description (Optional)
api_key="my-api-key", # API key (Optional, can be set in the environment variable PEBBLO_API_KEY)
load_semantic=True, # Load semantic data (Optional, default is False, can be set in the environment variable PEBBLO_LOAD_SEMANTIC)
)
documents = loader.load()
print(documents[0].metadata)

匿名化代码片段以删除所有个人身份信息(PII)详情

anonymize_snippets 设置为 True,以匿名化处理进入向量数据库的代码片段及生成报告中的所有个人身份信息(PII)。

Note: The Pebblo Entity Classifier effectively identifies personally identifiable information (PII) and is continuously evolving. While its recall is not yet 100%, it is steadily improving. For more details, please refer to the Pebblo Entity Classifier docs

from langchain_community.document_loaders import CSVLoader, PebbloSafeLoader

loader = PebbloSafeLoader(
CSVLoader("data/corp_sens_data.csv"),
name="acme-corp-rag-1", # App name (Mandatory)
owner="Joe Smith", # Owner (Optional)
description="Support productivity RAG application", # Description (Optional)
anonymize_snippets=True, # Whether to anonymize entities in the PDF Report (Optional, default=False)
)
documents = loader.load()
print(documents[0].metadata)