如何拆分HTML
将HTML文档分割成可管理的片段对于各种文本处理任务至关重要,例如自然语言处理、搜索索引等。在本指南中,我们将探讨LangChain提供的三种不同的文本分割器,你可以使用它们来有效地分割HTML内容:
这些分块器各自具有独特的特性和使用场景。本指南将帮助你了解它们之间的差异,为何你会选择其中一个而非其他,并如何有效地使用它们。
%pip install -qU langchain-text-splitters
Splitters 概览
HTMLHeaderTextSplitter
当你希望根据文档的标题保留其层级结构时使用。
描述: 根据标题标签(例如 <h1>, <h2>, <h3> 等)拆分HTML文本,并为每个标题添加与给定块相关的元数据。
功能:
- 在HTML元素级别分割文本。
- 保留文档结构中编码的丰富上下文信息。
- 可以逐个返回块,或者将具有相同元数据的元素合并。
HTMLSectionSplitter
当您希望将HTML文档拆分为较大的部分时很有用,例如 <section>、<div> 或自定义定义的部分。
描述: 与HTMLHeaderTextSplitter类似,但专注于根据指定标签将HTML拆分为多个部分。
功能:
- 使用XSLT转换来检测和拆分章节。
- 内部使用
RecursiveCharacterTextSplitter表示大段内容。 - 考虑字体大小以确定章节。
HTMLSemanticPreservingSplitter
适用于需要确保结构化元素不会被拆分到不同块中的场景,从而保留上下文的相关性。
描述: 在保持重要元素(如表格、列表和其他HTML组件)语义结构的同时,将HTML内容拆分为可管理的块。
功能:
- 保留表格、列表和其他指定的HTML元素。
- 允许为特定的HTML标签自定义处理程序。
- 确保文档的语义含义得以保留。
- 内置的标准化和停用词移除
选择合适的分割器
- 当需要时使用
HTMLHeaderTextSplitter: 您需要根据标题层级拆分HTML文档,并保留有关标题的元数据。 - 使用
HTMLSectionSplitter当: 您需要将文档拆分为更大、更通用的章节,可能基于自定义标签或字体大小。 - 当需要时使用
HTMLSemanticPreservingSplitter: 您需要在保留语义元素(如表格和列表)的同时将文档拆分为块,确保这些元素不会被拆分,并且它们的上下文得以保持。
| 特性 | HTMLHeaderTextSplitter | HTMLSectionSplitter | HTMLSemanticPreservingSplitter |
|---|---|---|---|
| Splits based on headers | Yes | Yes | Yes |
| Preserves semantic elements (tables, lists) | No | No | Yes |
| Adds metadata for headers | Yes | Yes | Yes |
| Custom handlers for HTML tags | No | No | Yes |
| Preserves media (images, videos) | No | No | Yes |
| Considers font sizes | No | Yes | No |
| Uses XSLT transformations | No | Yes | No |
示例HTML文档
让我们以以下HTML文档为例:
html_string = """
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='UTF-8'>
<meta name='viewport' content='width=device-width, initial-scale=1.0'>
<title>Fancy Example HTML Page</title>
</head>
<body>
<h1>Main Title</h1>
<p>This is an introductory paragraph with some basic content.</p>
<h2>Section 1: Introduction</h2>
<p>This section introduces the topic. Below is a list:</p>
<ul>
<li>First item</li>
<li>Second item</li>
<li>Third item with <strong>bold text</strong> and <a href='#'>a link</a></li>
</ul>
<h3>Subsection 1.1: Details</h3>
<p>This subsection provides additional details. Here's a table:</p>
<table border='1'>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
<th>Header 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Row 1, Cell 1</td>
<td>Row 1, Cell 2</td>
<td>Row 1, Cell 3</td>
</tr>
<tr>
<td>Row 2, Cell 1</td>
<td>Row 2, Cell 2</td>
<td>Row 2, Cell 3</td>
</tr>
</tbody>
</table>
<h2>Section 2: Media Content</h2>
<p>This section contains an image and a video:</p>
<img src='example_image_link.mp4' alt='Example Image'>
<video controls width='250' src='example_video_link.mp4' type='video/mp4'>
Your browser does not support the video tag.
</video>
<h2>Section 3: Code Example</h2>
<p>This section contains a code block:</p>
<pre><code data-lang="html">
<div>
<p>This is a paragraph inside a div.</p>
</div>
</code></pre>
<h2>Conclusion</h2>
<p>This is the conclusion of the document.</p>
</body>
</html>
"""
使用 HTMLHeaderTextSplitter
HTMLHeaderTextSplitter 是一种“结构感知型”的 文本分割器,它在 HTML 元素级别分割文本,并为每个标题添加元数据,这些元数据与给定的文本块相关。它可以按元素逐个返回分块,也可以将具有相同元数据的元素合并,其目标是(a)保持相关内容大致语义上的分组,以及(b)保留文档结构中编码的富含上下文的信息。它可以用作分块管道的一部分,与其他文本分割器配合使用。
它类似于用于Markdown文件的 MarkdownHeaderTextSplitter。
要指定按哪些标题进行分割,请在实例化 HTMLHeaderTextSplitter 时指定 headers_to_split_on,如下所示。
from langchain_text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a list: \nFirst item Second item Third item with bold text and a link'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction', 'Header 3': 'Subsection 1.1: Details'}, page_content="This subsection provides additional details. Here's a table:"),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block:'),
Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
要返回每个元素及其关联的标题,请在实例化 HTMLHeaderTextSplitter 时指定 return_each_element=True:
html_splitter = HTMLHeaderTextSplitter(
headers_to_split_on,
return_each_element=True,
)
html_header_splits_elements = html_splitter.split_text(html_string)
与上述情况相比,元素是按其标题进行聚合的:
for element in html_header_splits[:2]:
print(element)
page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'}
page_content='This section introduces the topic. Below is a list:
First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}
现在每个元素都被作为独立的 Document 返回:
for element in html_header_splits_elements[:3]:
print(element)
page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'}
page_content='This section introduces the topic. Below is a list:' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}
page_content='First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}
如何从URL或HTML文件中拆分:
要直接从URL读取,请将URL字符串传递给 split_text_from_url 方法。
同样,可以将本地HTML文件传递给 split_text_from_file 方法。
url = "https://plato.stanford.edu/entries/goedel/"
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
("h4", "Header 4"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)
如何限制块大小:
HTMLHeaderTextSplitter,根据 HTML 标题进行分割,可以与另一个基于字符长度约束分割的分隔器组合使用,例如 RecursiveCharacterTextSplitter。
这可以使用第二个分割器的 .split_documents 方法来完成:
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
# Split
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]
[Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berry’s paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth'),
Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.'),
Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='This account of Gödel’s discovery was told to Hao Wang very much after the fact; but in Gödel’s contemporary correspondence with Bernays and Zermelo, essentially the same description of his path to the theorems is given. (See Gödel 2003a and Gödel 2003b respectively.) From those accounts we see that the undefinability of truth in arithmetic, a result credited to Tarski, was likely obtained in some form by Gödel by 1931. But he neither publicized nor published the result; the biases logicians'),
Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='result; the biases logicians had expressed at the time concerning the notion of truth, biases which came vehemently to the fore when Tarski announced his results on the undefinability of truth in formal systems 1935, may have served as a deterrent to Gödel’s publication of that theorem.'),
Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'}, page_content='We now describe the proof of the two theorems, formulating Gödel’s results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödel’s notation.')]
限制
从一个HTML文档到另一个HTML文档,结构上的差异可能相当大,而尽管HTMLHeaderTextSplitter会尝试将所有“相关”的标题附加到任何给定的文本块上,但有时仍可能遗漏某些标题。例如,该算法假设一种信息层级结构,其中标题始终位于相关文本的“上方”节点,即前一个兄弟节点、祖先节点或它们的组合。在以下新闻文章中(以本文撰写时为准),文档的结构使得顶级标题的文本虽然标记为"h1",却处于与我们预期它应“位于其上方”的文本元素完全不同的子树中——因此我们可以观察到,“h1”元素及其相关文本并未出现在文本块元数据中(但若适用,我们会看到“h2”及其相关文本):
url = "https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html"
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
print(html_header_splits[1].page_content[:500])
No two El Niño winters are the same, but many have temperature and precipitation trends in common.
Average conditions during an El Niño winter across the continental US.
One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA.
Because the jet stream is essentially a river of air that storms flow through, they c
使用 HTMLSectionSplitter
在概念上类似于 HTMLHeaderTextSplitter,HTMLSectionSplitter 是一种“结构感知型”的 文本分割器,它会在元素级别分割文本,并为每个标题添加元数据,这些标题与任何给定的文本块“相关”。它允许你按章节分割HTML。
它可以逐个返回块,或者将具有相同元数据的元素合并,其目标是:(a) 保持相关文本大致语义上分组,以及 (b) 保留文档结构中编码的富含上下文的信息。
使用 xslt_path 提供绝对路径,以转换 HTML,使其能够根据提供的标签检测章节。默认情况下,使用 converting_to_header.xslt 文件中的 data_connection/document_transformers 目录。这是为了将 HTML 转换为更容易检测章节的格式/布局。例如,可以根据 span 的字体大小将其转换为标题标签,以便检测为章节。
如何拆分HTML字符串:
from langchain_text_splitters import HTMLSectionSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]
html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title \n This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content="Section 1: Introduction \n This section introduces the topic. Below is a list: \n \n First item \n Second item \n Third item with bold text and a link \n \n \n Subsection 1.1: Details \n This subsection provides additional details. Here's a table: \n \n \n \n Header 1 \n Header 2 \n Header 3 \n \n \n \n \n Row 1, Cell 1 \n Row 1, Cell 2 \n Row 1, Cell 3 \n \n \n Row 2, Cell 1 \n Row 2, Cell 2 \n Row 2, Cell 3"),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content \n This section contains an image and a video: \n \n \n Your browser does not support the video tag.'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example \n This section contains a code block: \n \n <div>\n <p>This is a paragraph inside a div.</p>\n </div>'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion \n This is the conclusion of the document.')]
如何限制块大小:
HTMLSectionSplitter 可以与其他文本分割器一起用作分块管道的一部分。内部,当段落大小大于分块大小时,它会使用 RecursiveCharacterTextSplitter。它还会根据确定的字体大小阈值来考虑文本的字体大小,以判断是否为一个段落。
from langchain_text_splitters import RecursiveCharacterTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
chunk_size = 50
chunk_overlap = 5
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
# Split
splits = text_splitter.split_documents(html_header_splits)
splits
[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title'),
Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some'),
Document(metadata={'Header 1': 'Main Title'}, page_content='some basic content.'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='Section 1: Introduction'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='is a list:'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='First item \n Second item'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='Third item with bold text and a link'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Subsection 1.1: Details'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='This subsection provides additional details.'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content="Here's a table:"),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Header 1 \n Header 2 \n Header 3'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 1, Cell 1 \n Row 1, Cell 2'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 1, Cell 3 \n \n \n Row 2, Cell 1'),
Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 2, Cell 2 \n Row 2, Cell 3'),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content'),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Your browser does not support the video'),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='tag.'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: \n \n <div>'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='<p>This is a paragraph inside a div.</p>'),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='</div>'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
使用 HTMLSemanticPreservingSplitter
HTMLSemanticPreservingSplitter 的设计目的是将 HTML 内容分割成易于管理的块,同时保留重要元素(如表格、列表和其他 HTML 组件)的语义结构。这确保了这些元素不会被分割到不同的块中,从而避免因丢失上下文相关性(例如表格标题、列表标题等)而导致的问题。
此分隔器的核心设计目的是创建上下文相关的块。使用 HTMLHeaderTextSplitter 的通用递归分隔可能会导致表格、列表和其他结构化元素被从中切断,从而丢失重要上下文并生成质量较差的块。
HTMLSemanticPreservingSplitter 对于分割包含表格和列表等结构化元素的HTML内容至关重要,尤其是在需要完整保留这些元素时。此外,它能够为特定的HTML标签定义自定义处理程序,使其成为处理复杂HTML文档的多功能工具。
重要提示: max_chunk_size 不是块的最大确定大小,最大大小的计算发生在保留内容不属于该块时,以确保它不会被分割。当我们把保留的数据重新添加回块中时,块的大小可能会超过 max_chunk_size。这一点至关重要,以确保我们保持原始文档的结构
Notes:
- 我们已定义了一个自定义处理器,用于重新格式化代码块中的内容
- 我们为特定的 HTML 元素定义了一个拒绝列表,以便在预处理阶段将其分解及其内容。
- 我们有意设置了较小的块大小,以演示元素不会被分割的情况。
# BeautifulSoup is required to use the custom handlers
from bs4 import Tag
from langchain_text_splitters import HTMLSemanticPreservingSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]
def code_handler(element: Tag) -> str:
data_lang = element.get("data-lang")
code_format = f"<code:{data_lang}>{element.get_text()}</code>"
return code_format
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
separators=["\n\n", "\n", ". ", "! ", "? "],
max_chunk_size=50,
preserve_images=True,
preserve_videos=True,
elements_to_preserve=["table", "ul", "ol", "code"],
denylist_tags=["script", "style", "head"],
custom_handlers={"code": code_handler},
)
documents = splitter.split_text(html_string)
documents
[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'),
Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3"),
Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:  '),
Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'),
Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
保留表格和列表
在此示例中,我们将演示 HTMLSemanticPreservingSplitter 如何在 HTML 文档中保留表格和大型列表。分块大小将设置为 50 个字符,以说明分割器如何确保即使这些元素超过定义的最大分块大小,也不会被拆分。
from langchain_text_splitters import HTMLSemanticPreservingSplitter
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section 1</h1>
<p>This section contains an important table and list that should not be split across chunks.</p>
<table>
<tr>
<th>Item</th>
<th>Quantity</th>
<th>Price</th>
</tr>
<tr>
<td>Apples</td>
<td>10</td>
<td>$1.00</td>
</tr>
<tr>
<td>Oranges</td>
<td>5</td>
<td>$0.50</td>
</tr>
<tr>
<td>Bananas</td>
<td>50</td>
<td>$1.50</td>
</tr>
</table>
<h2>Subsection 1.1</h2>
<p>Additional text in subsection 1.1 that is separated from the table and list.</p>
<p>Here is a detailed list:</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
elements_to_preserve=["table", "ul"],
)
documents = splitter.split_text(html_string)
print(documents)
[Document(metadata={'Header 1': 'Section 1'}, page_content='This section contains an important table and list'), Document(metadata={'Header 1': 'Section 1'}, page_content='that should not be split across chunks.'), Document(metadata={'Header 1': 'Section 1'}, page_content='Item Quantity Price Apples 10 $1.00 Oranges 5 $0.50 Bananas 50 $1.50'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='Additional text in subsection 1.1 that is'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='separated from the table and list. Here is a'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content="detailed list: Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]
解释
在此示例中,HTMLSemanticPreservingSplitter 确保整个表格和无序列表(<ul>)都保留在各自的块中。即使块大小设置为 50 个字符,分割器仍能识别这些元素不应被拆分,并将其保持完整。
在处理数据表或列表时,这一点尤其重要,因为分割内容可能导致上下文丢失或混淆。生成的 Document 对象保留了这些元素的完整结构,从而确保信息的上下文相关性得以维持。
使用自定义处理程序
HTMLSemanticPreservingSplitter 允许你为特定的 HTML 元素定义自定义处理程序。某些平台具有自定义 HTML 标签,这些标签不会被 BeautifulSoup 原生解析,当这种情况发生时,你可以使用自定义处理程序轻松添加格式化逻辑。
这在处理需要特殊处理的元素时特别有用,例如 <iframe> 标签或特定的 'data-' 元素。在此示例中,我们将为 iframe 标签创建一个自定义处理器,将其转换为类似 Markdown 的链接。
def custom_iframe_extractor(iframe_tag):
iframe_src = iframe_tag.get("src", "")
return f"[iframe:{iframe_src}]({iframe_src})"
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
separators=["\n\n", "\n", ". "],
elements_to_preserve=["table", "ul", "ol"],
custom_handlers={"iframe": custom_iframe_extractor},
)
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section with Iframe</h1>
<iframe src="https://example.com/embed"></iframe>
<p>Some text after the iframe.</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
documents = splitter.split_text(html_string)
print(documents)
[Document(metadata={'Header 1': 'Section with Iframe'}, page_content='[iframe:https://example.com/embed](https://example.com/embed) Some text after the iframe'), Document(metadata={'Header 1': 'Section with Iframe'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]
解释
在此示例中,我们为 iframe 标签定义了一个自定义处理程序,将其转换为类似 Markdown 的链接。当分隔器处理 HTML 内容时,它会使用此自定义处理程序来转换 iframe 标签,同时保留其他元素(如表格和列表)。生成的 Document 对象显示了根据您提供的自定义逻辑如何处理 iframe。
重要: 在保留链接等项目时,应注意不要在分隔符中包含 .,也不要留空分隔符。RecursiveCharacterTextSplitter 会在句号处进行分割,这会导致链接被截断。请确保提供一个包含 . 的分隔符列表。
使用自定义处理程序通过大型语言模型分析图像
通过自定义处理器,我们还可以覆盖任何元素的默认处理方式。一个很好的例子是在分块流程中直接插入对文档内图像的语义分析。
由于我们的函数在发现标签时被调用,因此我们可以覆盖 <img> 标签并关闭 preserve_images,以插入我们希望嵌入到数据块中的任何内容。
"""This example assumes you have helper methods `load_image_from_url` and an LLM agent `llm` that can process image data."""
from langchain.agents import AgentExecutor
# This example needs to be replaced with your own agent
llm = AgentExecutor(...)
# This method is a placeholder for loading image data from a URL and is not implemented here
def load_image_from_url(image_url: str) -> bytes:
# Assuming this method fetches the image data from the URL
return b"image_data"
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section with Image and Link</h1>
<p>
<img src="https://example.com/image.jpg" alt="An example image" />
Some text after the image.
</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
def custom_image_handler(img_tag) -> str:
img_src = img_tag.get("src", "")
img_alt = img_tag.get("alt", "No alt text provided")
image_data = load_image_from_url(img_src)
semantic_meaning = llm.invoke(image_data)
markdown_text = f"[Image Alt Text: {img_alt} | Image Source: {img_src} | Image Semantic Meaning: {semantic_meaning}]"
return markdown_text
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
separators=["\n\n", "\n", ". "],
elements_to_preserve=["ul"],
preserve_images=False,
custom_handlers={"img": custom_image_handler},
)
documents = splitter.split_text(html_string)
print(documents)
[Document(metadata={'Header 1': 'Section with Image and Link'}, page_content='[Image Alt Text: An example image | Image Source: https://example.com/image.jpg | Image Semantic Meaning: semantic-meaning] Some text after the image'),
Document(metadata={'Header 1': 'Section with Image and Link'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]
Explanation:
在编写了自定义处理程序以从HTML中的<img>元素提取特定字段后,我们可以进一步使用代理处理数据,并将结果直接插入到我们的数据块中。确保将preserve_images设置为False非常重要,否则将执行<img>字段的默认处理。