Skip to main content
Open In ColabOpen on GitHub

Airbyte 加载器

LangChain AI开发框架是一个用于从API、数据库和文件到数据仓库和湖的ELT管道的数据集成平台。它拥有最大的ELT连接器目录,可以连接到数据仓库和数据库。

这涵盖了如何将来自Airbyte的任何源加载到LangChain文档中

安装

要使用AirbyteLoader,您需要安装langchain-airbyte集成包。

%pip install -qU langchain-airbyte

注解: 目前,airbyte 库不支持 Pydantic v2。 请降级到 Pydantic v1 以使用此包。

注意:此包当前也需要 Python 3.10+。

加载文档

默认情况下,AirbyteLoader 将从流中加载任何结构化数据并输出 yaml 格式的文档。

from langchain_airbyte import AirbyteLoader

loader = AirbyteLoader(
source="source-faker",
stream="users",
config={"count": 10},
)
docs = loader.load()
print(docs[0].page_content[:500])
\`\`\`yaml
academic_degree: PhD
address:
city: Lauderdale Lakes
country_code: FI
postal_code: '75466'
province: New Jersey
state: Hawaii
street_name: Stoneyford
street_number: '1112'
age: 44
blood_type: "O\u2212"
created_at: '2004-04-02T13:05:27+00:00'
email: bread2099+1@outlook.com
gender: Fluid
height: '1.62'
id: 1
language: Belarusian
name: Moses
nationality: Dutch
occupation: Track Worker
telephone: 1-467-194-2318
title: M.Sc.Tech.
updated_at: '2024-02-27T16:41:01+00:00'
weight: 6

您可以自定义提示模板来格式化文档:

from langchain_core.prompts import PromptTemplate

loader_templated = AirbyteLoader(
source="source-faker",
stream="users",
config={"count": 10},
template=PromptTemplate.from_template(
"My name is {name} and I am {height} meters tall."
),
)
docs_templated = loader_templated.load()
print(docs_templated[0].page_content)
API 参考:提示模板
My name is Verdie and I am 1.73 meters tall.

懒加载文档

One of the powerful features of AirbyteLoader is its ability to load large documents from upstream sources. When working with large datasets, the default .load() behavior can be slow and memory-intensive. To avoid this, you can use the .lazy_load() method to load documents in a more memory-efficient manner.

import time

loader = AirbyteLoader(
source="source-faker",
stream="users",
config={"count": 3},
template=PromptTemplate.from_template(
"My name is {name} and I am {height} meters tall."
),
)

start_time = time.time()
my_iterator = loader.lazy_load()
print(
f"Just calling lazy load is quick! This took {time.time() - start_time:.4f} seconds"
)
Just calling lazy load is quick! This took 0.0001 seconds

可以逐个迭代生成的文档:

for doc in my_iterator:
print(doc.page_content)
My name is Andera and I am 1.91 meters tall.
My name is Jody and I am 1.85 meters tall.
My name is Zonia and I am 1.53 meters tall.

您也可以以异步方式懒加载文档,使用.alazy_load():

loader = AirbyteLoader(
source="source-faker",
stream="users",
config={"count": 3},
template=PromptTemplate.from_template(
"My name is {name} and I am {height} meters tall."
),
)

my_async_iterator = loader.alazy_load()

async for doc in my_async_iterator:
print(doc.page_content)
My name is Carmelina and I am 1.74 meters tall.
My name is Ali and I am 1.90 meters tall.
My name is Rochell and I am 1.83 meters tall.

配置

AirbyteLoader 可以配置以下选项:

  • source (str, 必需): 用于加载的Airbyte源名称。
  • stream (str, 必需): 要加载的流名称(Airbyte 源可以返回多个流)
  • config (字典,必需):Airbyte 源的配置
  • template (提示模板,可选): 用于格式化文档的自定义提示模板
  • include_metadata (bool, optional, default True): 是否在输出文档中包含所有字段作为元数据

大多数配置将在 config 中进行,您可以在 Airbyte 文档 中每个源的“配置字段参考”部分找到具体的配置选项。