Skip to main content
Open In ColabOpen on GitHub

Apify 数据集

Apify Dataset is a scalable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of Apify Actors—serverless cloud programs for various web scraping, crawling, and data extraction use cases.

本笔记本展示了如何将 Apify 数据集加载到 LangChain。

先决条件

您需要在 Apify 平台上拥有一个现有数据集。本示例展示了如何加载由 网站内容爬虫 生成的数据集。

%pip install --upgrade --quiet langchain langchain-apify langchain-openai

首先,将 ApifyDatasetLoader 导入到您的源代码中:

from langchain_apify import ApifyDatasetLoader
from langchain_core.documents import Document
API 参考:文档

找到您的 Apify API 令牌OpenAI API 密钥,并将它们初始化为环境变量:

import os

os.environ["APIFY_API_TOKEN"] = "your-apify-api-token"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

然后提供一个函数,将 Apify 数据集记录字段映射到 LangChain Document 格式。

例如,如果您的数据集项结构如下:

{
"url": "https://apify.com",
"text": "Apify is the best web scraping and automation platform."
}

下方代码中的映射函数会将它们转换为 LangChain Document 格式,以便您可以将它们与任何大语言模型(LLM)进一步配合使用(例如用于问答)。

loader = ApifyDatasetLoader(
dataset_id="your-dataset-id",
dataset_mapping_function=lambda dataset_item: Document(
page_content=dataset_item["text"], metadata={"source": dataset_item["url"]}
),
)
data = loader.load()

问答示例

在本示例中,我们使用数据集中的数据来回答问题。

from langchain.indexes import VectorstoreIndexCreator
from langchain_apify import ApifyWrapper
from langchain_core.documents import Document
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
loader = ApifyDatasetLoader(
dataset_id="your-dataset-id",
dataset_mapping_function=lambda item: Document(
page_content=item["text"] or "", metadata={"source": item["url"]}
),
)
index = VectorstoreIndexCreator(
vectorstore_cls=InMemoryVectorStore, embedding=OpenAIEmbeddings()
).from_loaders([loader])
llm = ChatOpenAI(model="gpt-4o-mini")
query = "What is Apify?"
result = index.query_with_sources(query, llm=llm)
print(result["answer"])
print(result["sources"])
 Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.

https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples