Skip to main content
Open In ColabOpen on GitHub

如何创建自定义文档加载器

概览

基于大型语言模型(LLM)的应用程序通常涉及从数据库或文件(如 PDF)中提取数据,并将其转换为 LLM 可利用的格式。在 LangChain 中,这通常涉及创建 Document 对象,该对象封装了提取的文本(page_content)以及元数据——一个包含文档详细信息(如作者姓名或出版日期)的字典。

Document 对象通常被格式化为提示词并输入到大型语言模型(LLM)中,使 LLM 能够利用 Document 中的信息生成所需的响应(例如,对文档进行摘要)。 Documents 既可以立即使用,也可以索引到向量存储中以便未来检索和使用。

Document Loading(文档加载)的主要抽象包括:

组件描述
DocumentContains text and metadata
BaseLoaderUse to convert raw data into Documents
BlobA representation of binary data that's located either in a file or in memory
BaseBlobParserLogic to parse a Blob to yield Document objects

本指南将演示如何编写自定义文档加载和文件解析逻辑;具体来说,我们将看到如何:

  1. 通过从 BaseLoader 继承来创建标准的文档加载器。
  2. 使用 BaseBlobParser 创建一个解析器,并将其与 BlobBlobLoaders 结合使用。这主要用于处理文件时。

标准文档加载器

文档加载器可以通过从提供加载文档标准接口的BaseLoader继承来实现。

接口

方法名称解释
lazy_loadUsed to load documents one by one lazily. Use for production code.
alazy_loadAsync variant of lazy_load
loadUsed to load all the documents into memory eagerly. Use for prototyping or interactive work.
aloadUsed to load all the documents into memory eagerly. Use for prototyping or interactive work. Added in 2024-04 to LangChain.
  • The load 方法是一个仅用于原型开发的便捷方法——它只是调用了 list(self.lazy_load())
  • alazy_load 具有一个默认实现,该实现将委托给 lazy_load。如果您正在使用异步(async),我们建议覆盖默认实现并提供原生的异步实现。
重要

在实现文档加载器时,不要通过lazy_loadalazy_load方法提供参数。

所有配置都预期通过初始化器(init)传递。这是 LangChain 做出的设计选择,以确保一旦文档加载器被实例化,它就拥有加载文档所需的所有信息。

实现

让我们创建一个标准文档加载器的示例,该加载器加载一个文件并从文件的每一行创建一个文档。

from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document


class CustomDocumentLoader(BaseLoader):
"""An example document loader that reads a file line by line."""

def __init__(self, file_path: str) -> None:
"""Initialize the loader with a file path.

Args:
file_path: The path to the file to load.
"""
self.file_path = file_path

def lazy_load(self) -> Iterator[Document]: # <-- Does not take any arguments
"""A lazy loader that reads a file line by line.

When you're implementing lazy load methods, you should use a generator
to yield documents one by one.
"""
with open(self.file_path, encoding="utf-8") as f:
line_number = 0
for line in f:
yield Document(
page_content=line,
metadata={"line_number": line_number, "source": self.file_path},
)
line_number += 1

# alazy_load is OPTIONAL.
# If you leave out the implementation, a default implementation which delegates to lazy_load will be used!
async def alazy_load(
self,
) -> AsyncIterator[Document]: # <-- Does not take any arguments
"""An async lazy loader that reads a file line by line."""
# Requires aiofiles (install with pip)
# https://github.com/Tinche/aiofiles
import aiofiles

async with aiofiles.open(self.file_path, encoding="utf-8") as f:
line_number = 0
async for line in f:
yield Document(
page_content=line,
metadata={"line_number": line_number, "source": self.file_path},
)
line_number += 1

Test 🧪

为了测试文档加载器,我们需要一个包含优质内容的文件。

with open("./meow.txt", "w", encoding="utf-8") as f:
quality_content = "meow meow🐱 \n meow meow🐱 \n meow😻😻"
f.write(quality_content)

loader = CustomDocumentLoader("./meow.txt")
## Test out the lazy load interface
for doc in loader.lazy_load():
print()
print(type(doc))
print(doc)

<class 'langchain_core.documents.base.Document'>
page_content='meow meow🐱 \n' metadata={'line_number': 0, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow meow🐱 \n' metadata={'line_number': 1, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}
## Test out the async implementation
async for doc in loader.alazy_load():
print()
print(type(doc))
print(doc)

<class 'langchain_core.documents.base.Document'>
page_content='meow meow🐱 \n' metadata={'line_number': 0, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow meow🐱 \n' metadata={'line_number': 1, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}
提示

load() 在交互式环境(如 Jupyter Notebook)中可能很有帮助。

避免将其用于生产代码,因为预加载假设所有内容都能放入内存中,但这并不总是成立,尤其是对于企业数据而言。

loader.load()
[Document(page_content='meow meow🐱 \n', metadata={'line_number': 0, 'source': './meow.txt'}),
Document(page_content=' meow meow🐱 \n', metadata={'line_number': 1, 'source': './meow.txt'}),
Document(page_content=' meow😻😻', metadata={'line_number': 2, 'source': './meow.txt'})]

处理文件

许多文档加载器涉及解析文件。此类加载器之间的差异通常源于文件的解析方式,而非文件的加载方式。例如,您可以使用 open 读取 PDF 文件或 Markdown 文件的二进制内容,但您需要不同的解析逻辑将这些二进制数据转换为文本。

因此,将解析逻辑与加载逻辑解耦是有帮助的,这使得无论数据如何加载,都更容易重用给定的解析器。

BaseBlobParser

一个 BaseBlobParser 是一个接口,它接受一个 blob 并输出一个 Document 对象列表。一个 blob 是表示数据的形式,这些数据可以存储在内存中或文件中。LangChain Python 拥有一个 Blob 原始类型,其灵感来源于 Blob WebAPI 规范

from langchain_core.document_loaders import BaseBlobParser, Blob


class MyParser(BaseBlobParser):
"""A simple parser that creates a document from each line."""

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Parse a blob into a document line by line."""
line_number = 0
with blob.as_bytes_io() as f:
for line in f:
line_number += 1
yield Document(
page_content=line,
metadata={"line_number": line_number, "source": blob.source},
)
blob = Blob.from_path("./meow.txt")
parser = MyParser()
list(parser.lazy_parse(blob))
[Document(page_content='meow meow🐱 \n', metadata={'line_number': 1, 'source': './meow.txt'}),
Document(page_content=' meow meow🐱 \n', metadata={'line_number': 2, 'source': './meow.txt'}),
Document(page_content=' meow😻😻', metadata={'line_number': 3, 'source': './meow.txt'})]

使用 blob API 还可以允许您直接从内存加载内容,而无需从文件中读取!

blob = Blob(data=b"some data from memory\nmeow")
list(parser.lazy_parse(blob))
[Document(page_content='some data from memory\n', metadata={'line_number': 1, 'source': None}),
Document(page_content='meow', metadata={'line_number': 2, 'source': None})]

Blob

让我们快速浏览一下 Blob API 的一些内容。

blob = Blob.from_path("./meow.txt", metadata={"foo": "bar"})
blob.encoding
'utf-8'
blob.as_bytes()
b'meow meow\xf0\x9f\x90\xb1 \n meow meow\xf0\x9f\x90\xb1 \n meow\xf0\x9f\x98\xbb\xf0\x9f\x98\xbb'
blob.as_string()
'meow meow🐱 \n meow meow🐱 \n meow😻😻'
blob.as_bytes_io()
<contextlib._GeneratorContextManager at 0x743f34324450>
blob.metadata
{'foo': 'bar'}
blob.source
'./meow.txt'

Blob 加载器

虽然解析器封装了将二进制数据解析为文档所需的逻辑,但blob加载器封装了从给定存储位置加载blob所需的逻辑。

目前,LangChain仅支持FileSystemBlobLoader

您可以使用FileSystemBlobLoader来加载blobs,然后使用解析器对其进行解析。

from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader

blob_loader = FileSystemBlobLoader(path=".", glob="*.mdx", show_progress=True)
parser = MyParser()
for blob in blob_loader.yield_blobs():
for doc in parser.lazy_parse(blob):
print(doc)
break
  0%|          | 0/8 [00:00<?, ?it/s]
page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
page_content='# Markdown\n' metadata={'line_number': 1, 'source': 'markdown.mdx'}
page_content='# JSON\n' metadata={'line_number': 1, 'source': 'json.mdx'}
page_content='---\n' metadata={'line_number': 1, 'source': 'pdf.mdx'}
page_content='---\n' metadata={'line_number': 1, 'source': 'index.mdx'}
page_content='# File Directory\n' metadata={'line_number': 1, 'source': 'file_directory.mdx'}
page_content='# CSV\n' metadata={'line_number': 1, 'source': 'csv.mdx'}
page_content='# HTML\n' metadata={'line_number': 1, 'source': 'html.mdx'}

通用加载器

LangChain 具有一个 GenericLoader 抽象,它将 BlobLoaderBaseBlobParser 组合起来。

GenericLoader旨在提供标准化的类方法,使使用现有的BlobLoader实现变得简单。目前仅支持FileSystemBlobLoader

from langchain_community.document_loaders.generic import GenericLoader

loader = GenericLoader.from_filesystem(
path=".", glob="*.mdx", show_progress=True, parser=MyParser()
)

for idx, doc in enumerate(loader.lazy_load()):
if idx < 5:
print(doc)

print("... output truncated for demo purposes")
API 参考:通用加载器
  0%|          | 0/8 [00:00<?, ?it/s]
page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 2, 'source': 'office_file.mdx'}
page_content='>[The Microsoft Office](https://www.office.com/) suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.\n' metadata={'line_number': 3, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 4, 'source': 'office_file.mdx'}
page_content='This covers how to load commonly used file formats including `DOCX`, `XLSX` and `PPTX` documents into a document format that we can use downstream.\n' metadata={'line_number': 5, 'source': 'office_file.mdx'}
... output truncated for demo purposes

自定义通用加载器

如果你真的很喜欢创建类,你可以子类化并创建一个类来封装逻辑。

您可以从此类进行子类化,以使用现有的加载器加载内容。

from typing import Any


class MyCustomLoader(GenericLoader):
@staticmethod
def get_parser(**kwargs: Any) -> BaseBlobParser:
"""Override this method to associate a default parser with the class."""
return MyParser()
loader = MyCustomLoader.from_filesystem(path=".", glob="*.mdx", show_progress=True)

for idx, doc in enumerate(loader.lazy_load()):
if idx < 5:
print(doc)

print("... output truncated for demo purposes")
  0%|          | 0/8 [00:00<?, ?it/s]
page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 2, 'source': 'office_file.mdx'}
page_content='>[The Microsoft Office](https://www.office.com/) suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.\n' metadata={'line_number': 3, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 4, 'source': 'office_file.mdx'}
page_content='This covers how to load commonly used file formats including `DOCX`, `XLSX` and `PPTX` documents into a document format that we can use downstream.\n' metadata={'line_number': 5, 'source': 'office_file.mdx'}
... output truncated for demo purposes