Skip to main content
Open In ColabOpen on GitHub

阿里云MaxCompute

Alibaba Cloud MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.

MaxComputeLoader 允许您执行 MaxCompute SQL 查询,并将结果加载为每行一个文档。

%pip install --upgrade --quiet  pyodps
Collecting pyodps
Downloading pyodps-0.11.4.post0-cp39-cp39-macosx_10_9_universal2.whl (2.0 MB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 1.7 MB/s eta 0:00:0000:0100:010m
[?25hRequirement already satisfied: charset-normalizer>=2 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (3.1.0)
Requirement already satisfied: urllib3<2.0,>=1.26.0 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (1.26.15)
Requirement already satisfied: idna>=2.5 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (2023.5.7)
Installing collected packages: pyodps
Successfully installed pyodps-0.11.4.post0

基本用法

要实例化加载器,您需要提供要执行的 SQL 查询、MaxCompute 端点和项目名称,以及您的访问 ID 和秘密访问密钥。访问 ID 和秘密访问密钥既可以通过 access_idsecret_access_key 参数直接传入,也可以设置为环境变量 MAX_COMPUTE_ACCESS_IDMAX_COMPUTE_SECRET_ACCESS_KEY

from langchain_community.document_loaders import MaxComputeLoader
API 参考:MaxComputeLoader
base_query = """
SELECT *
FROM (
SELECT 1 AS id, 'content1' AS content, 'meta_info1' AS meta_info
UNION ALL
SELECT 2 AS id, 'content2' AS content, 'meta_info2' AS meta_info
UNION ALL
SELECT 3 AS id, 'content3' AS content, 'meta_info3' AS meta_info
) mydata;
"""
endpoint = "<ENDPOINT>"
project = "<PROJECT>"
ACCESS_ID = "<ACCESS ID>"
SECRET_ACCESS_KEY = "<SECRET ACCESS KEY>"
loader = MaxComputeLoader.from_params(
base_query,
endpoint,
project,
access_id=ACCESS_ID,
secret_access_key=SECRET_ACCESS_KEY,
)
data = loader.load()
print(data)
[Document(page_content='id: 1\ncontent: content1\nmeta_info: meta_info1', metadata={}), Document(page_content='id: 2\ncontent: content2\nmeta_info: meta_info2', metadata={}), Document(page_content='id: 3\ncontent: content3\nmeta_info: meta_info3', metadata={})]
print(data[0].page_content)
id: 1
content: content1
meta_info: meta_info1
print(data[0].metadata)
{}

指定哪些列是内容 vs 元数据

您可以使用page_content_columnsmetadata_columns参数配置应将哪些列子集加载为文档内容,哪些作为元数据。

loader = MaxComputeLoader.from_params(
base_query,
endpoint,
project,
page_content_columns=["content"], # Specify Document page content
metadata_columns=["id", "meta_info"], # Specify Document metadata
access_id=ACCESS_ID,
secret_access_key=SECRET_ACCESS_KEY,
)
data = loader.load()
print(data[0].page_content)
content: content1
print(data[0].metadata)
{'id': 1, 'meta_info': 'meta_info1'}