如何按 token 分割文本
语言模型有 token 限制。您不应超过该 token 限制。因此,当您 将文本分割为块 时,统计 token 数量是一个好主意。有许多分词器。当您在文本中统计 token 时,应使用与语言模型相同的分词器。
tiktoken
tiktoken 是一个由 OpenAI 创建的快速 BPE 分词器。
我们可以使用 tiktoken 来估算所使用的 token 数量。这对于 OpenAI 模型来说可能会更准确。
- 文本按传入的字符进行分割。
- 分块大小的测量方式:使用
tiktoken分词器。
CharacterTextSplitter, RecursiveCharacterTextSplitter, and TokenTextSplitter 可直接与 tiktoken 配合使用。
%pip install --upgrade --quiet langchain-text-splitters tiktoken
from langchain_text_splitters import CharacterTextSplitter
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
若要使用 CharacterTextSplitter 进行分割,然后使用 tiktoken 合并块,请调用其 .from_tiktoken_encoder() 方法。请注意,该方法产生的分割结果可能大于由 tiktoken 分词器测量的块大小。
.from_tiktoken_encoder() 方法接受 encoding_name 作为参数(例如 cl100k_base),或接受 model_name(例如 gpt-4)。所有额外参数如 chunk_size、chunk_overlap 和 separators 用于实例化 CharacterTextSplitter:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
为了实现块大小的硬性约束,我们可以使用 RecursiveCharacterTextSplitter.from_tiktoken_encoder,这样每个块如果大小超过限制,都会被递归拆分:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
model_name="gpt-4",
chunk_size=100,
chunk_overlap=0,
)
我们还可以加载一个 TokenTextSplitter 分割器,它可以直接与 tiktoken 配合使用,并确保每个分割部分都小于块大小。
from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our
某些书面语言(例如中文和日文)的字符编码为2个或更多token。直接使用TokenTextSplitter可能会导致一个字符的token被拆分到两个块中,从而产生格式错误的Unicode字符。请使用RecursiveCharacterTextSplitter.from_tiktoken_encoder或CharacterTextSplitter.from_tiktoken_encoder以确保块包含有效的Unicode字符串。
spaCy
spaCy 是一个用于高级自然语言处理的开源软件库,使用 Python 和 Cython 编程语言编写。
LangChain 基于 spaCy 分词器 实现了文本分割器。
- 文本分割方式:通过
spaCy分词器。 - 块大小的测量方式:按字符数量计算。
%pip install --upgrade --quiet spacy
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import SpacyTextSplitter
text_splitter = SpacyTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Justices of the Supreme Court.
My fellow Americans.
Last year COVID-19 kept us apart.
This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents.
But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.
But he badly miscalculated.
He thought he could roll into Ukraine and the world would roll over.
Instead he met a wall of strength he never imagined.
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.
SentenceTransformers
The SentenceTransformersTokenTextSplitter 是专为与 sentence-transformer 模型配合使用而设计的文本分割器。默认行为是将文本分割成适合您想要使用的 sentence transformer 模型的 token 窗口大小的块。
为了根据 sentence-transformers 分词器分割文本并限制 token 数量,请实例化一个 SentenceTransformersTokenTextSplitter。您可以选择指定:
chunk_overlap: token 重叠的整数计数;model_name: sentence-transformer 模型名称,默认为"sentence-transformers/all-mpnet-base-v2";tokens_per_chunk: 每个分块的目标 token 数量。
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "
count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)
2
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1
# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier
print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")
tokens in text to split: 514
text_chunks = splitter.split_text(text=text_to_split)
print(text_chunks[1])
lorem
NLTK
与其仅按"\n\n"进行分割,我们可以使用NLTK基于NLTK分词器进行分割。
- 文本分割方式:通过
NLTK分词器。 - 块大小的测量方式:按字符数量计算。
# pip install nltk
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import NLTKTextSplitter
text_splitter = NLTKTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Justices of the Supreme Court.
My fellow Americans.
Last year COVID-19 kept us apart.
This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents.
But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.
But he badly miscalculated.
He thought he could roll into Ukraine and the world would roll over.
Instead he met a wall of strength he never imagined.
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.
Groups of citizens blocking tanks with their bodies.
KoNLPY
KoNLPy: Python 中的韩语自然语言处理 是一个用于韩语自然语言处理(NLP)的 Python 包。
Token splitting involves the segmentation of text into smaller, more manageable units called tokens. These tokens are often words, phrases, symbols, or other meaningful elements crucial for further processing and analysis. In languages like English, token splitting typically involves separating words by spaces and punctuation marks. The effectiveness of token splitting largely depends on the tokenizer's understanding of the language structure, ensuring the generation of meaningful tokens. Since tokenizers designed for the English language are not equipped to understand the unique semantic structures of other languages, such as Korean, they cannot be effectively used for Korean language processing.
韩语的令牌拆分与 KoNLPy 的 Kkma 分析器
如果是韩文文本,KoNLPY 包含一个名为 Kkma(韩国知识形态素分析器)的形态学分析器。Kkma 提供对韩文文本的详细形态学分析。它将句子分解为单词,并将单词分解为各自的形态素,识别每个标记的词性。它可以将一段文本分割成单个句子,这对于处理长文本特别有用。
使用注意事项
虽然Kkma以其详细的分析而闻名,但值得注意的是,这种精度可能会影响处理速度。因此,Kkma最适合将分析深度置于快速文本处理之上的应用场景。
# pip install konlpy
# This is a long Korean document that we want to split up into its component sentences.
with open("./your_korean_doc.txt") as f:
korean_document = f.read()
from langchain_text_splitters import KonlpyTextSplitter
text_splitter = KonlpyTextSplitter()
texts = text_splitter.split_text(korean_document)
# The sentences are split with "\n\n" characters.
print(texts[0])
춘향전 옛날에 남원에 이 도령이라는 벼슬아치 아들이 있었다.
그의 외모는 빛나는 달처럼 잘생겼고, 그의 학식과 기예는 남보다 뛰어났다.
한편, 이 마을에는 춘향이라는 절세 가인이 살고 있었다.
춘 향의 아름다움은 꽃과 같아 마을 사람들 로부터 많은 사랑을 받았다.
어느 봄날, 도령은 친구들과 놀러 나갔다가 춘 향을 만 나 첫 눈에 반하고 말았다.
두 사람은 서로 사랑하게 되었고, 이내 비밀스러운 사랑의 맹세를 나누었다.
하지만 좋은 날들은 오래가지 않았다.
도령의 아버지가 다른 곳으로 전근을 가게 되어 도령도 떠나 야만 했다.
이별의 아픔 속에서도, 두 사람은 재회를 기약하며 서로를 믿고 기다리기로 했다.
그러나 새로 부임한 관아의 사또가 춘 향의 아름다움에 욕심을 내 어 그녀에게 강요를 시작했다.
춘 향 은 도령에 대한 자신의 사랑을 지키기 위해, 사또의 요구를 단호히 거절했다.
이에 분노한 사또는 춘 향을 감옥에 가두고 혹독한 형벌을 내렸다.
이야기는 이 도령이 고위 관직에 오른 후, 춘 향을 구해 내는 것으로 끝난다.
두 사람은 오랜 시련 끝에 다시 만나게 되고, 그들의 사랑은 온 세상에 전해 지며 후세에까지 이어진다.
- 춘향전 (The Tale of Chunhyang)
Hugging Face 分词器
Hugging Face 拥有许多分词器。
我们使用 Hugging Face tokenizer,GPT2TokenizerFast 来统计文本的 token 数量。
- 文本按传入的字符进行分割。
- 块大小的测量方式:通过由
Hugging Face分词器计算的 token 数量来衡量。
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.