text_chunk_mapper¶
Split input text into chunks based on specified criteria.
Splits the input text into multiple chunks using a specified maximum length and a split pattern.
If
max_len
is provided, the text is split into chunks with a maximum length ofmax_len
.If
split_pattern
is provided, the text is split at occurrences of the pattern. If the length exceedsmax_len
, it will force a cut.The
overlap_len
parameter specifies the overlap length between consecutive chunks if the split does not occur at the pattern.Uses a Hugging Face tokenizer to calculate the text length in tokens if a tokenizer name is provided; otherwise, it uses the string length.
Caches the following stats: 'chunk_count' (number of chunks generated for each sample).
Raises a
ValueError
if bothmax_len
andsplit_pattern
areNone
or ifoverlap_len
is greater than or equal tomax_len
.
根据指定的标准将输入文本拆分成多个块。
使用指定的最大长度和拆分模式将输入文本拆分成多个块。
如果提供了
max_len
,则将文本拆分成最大长度为max_len
的块。如果提供了
split_pattern
,则在模式出现处拆分文本。如果长度超过max_len
,则会强制切割。overlap_len
参数指定连续块之间的重叠长度,如果拆分不在模式处发生。如果提供了tokenizer名称,则使用Hugging Face tokenizer计算token长度;否则,使用字符串长度。
缓存以下统计信息:'chunk_count'(为每个样本生成的块数)。
如果
max_len
和split_pattern
都为None
,或者overlap_len
大于或等于max_len
,则引发ValueError
。
Type 算子类型: mapper
Tags 标签: cpu, api, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
typing.Optional[typing.Annotated[int, Gt(gt=0)]] |
|
Split text into multi texts with this max len if not None. |
|
typing.Optional[str] |
|
Make sure split in this pattern if it is not None and force cut if the length exceeds max_len. |
|
typing.Annotated[int, Ge(ge=0)] |
|
Overlap length of the split texts if not split in the split pattern. |
|
typing.Optional[str] |
|
The tokenizer name of Hugging Face tokenizers. The text length will be calculate as the token num if it is offered. Otherwise, the text length equals to string length. Support tiktoken tokenizer (such as gpt-4o), dashscope tokenizer ( such as qwen2.5-72b-instruct) and huggingface tokenizer. |
|
<class 'bool'> |
|
whether to trust the remote code of HF models. |
|
|
extra args |
|
|
|
extra args |
📊 Effect demonstration 效果演示¶
test_naive_text_chunk¶
TextChunkMapper(split_pattern='\n')
📥 input data 输入数据¶
Today is Sunday and it's a happy day!
Sur la plateforme MT4, plusieurs manières d'accéder à ces fonctionnalités sont conçues simultanément.
欢迎来到阿里巴巴!
📤 output data 输出数据¶
Today is Sunday and it's a happy day!
Sur la plateforme MT4, plusieurs manières d'accéder à
ces fonctionnalités sont conçues simultanément.
欢迎来到阿里巴巴!
✨ explanation 解释¶
This example shows how the operator splits the input text into chunks based on a specified split pattern. Here, the split pattern is '\n', which means the text will be split at each newline character. In this case, only the second sample contains a newline, so it is split into two parts. The other samples do not contain newlines and remain unchanged. 这个例子展示了算子如何根据指定的分割模式将输入文本分割成多个块。这里,分割模式是'\n',意味着文本会在每个换行符处被分割。在这种情况下,只有第二个样本包含换行符,因此它被分成两部分。其他样本不包含换行符,所以保持不变。
test_max_len_text_chunk¶
TextChunkMapper(max_len=20, split_pattern=None)
📥 input data 输入数据¶
Today is Sunday and it's a happy day!
Sur la plateforme MT4, plusieurs manières d'accéder à ces fonctionnalités sont conçues simultanément.
欢迎来到阿里巴巴!
📤 output data 输出数据¶
Today is Sunday and
it's a happy day!
Sur la plateforme MT
4, plusieurs manière
s d'accéder à ces fo
nctionnalités sont c
onçues simultanément
.
欢迎来到阿里巴巴!
✨ explanation 解释¶
This example demonstrates how the operator splits the input text into chunks with a maximum length of 20 characters. The text is split into multiple segments, each no longer than 20 characters. If a word or phrase is cut off, it is included in the next segment. This ensures that the text is divided into manageable pieces without breaking words in the middle. 这个例子展示了算子如何将输入文本分割成最大长度为20个字符的多个块。文本被分割成多个段,每段不超过20个字符。如果某个词或短语被截断,它会被包含在下一段中。这样可以确保文本被分割成易于处理的部分,而不会在中间打断单词。