text_chunk_mapper

Split input text into chunks based on specified criteria.

  • Splits the input text into multiple chunks using a specified maximum length and a split pattern.

  • If max_len is provided, the text is split into chunks with a maximum length of max_len.

  • If split_pattern is provided, the text is split at occurrences of the pattern. If the length exceeds max_len, it will force a cut.

  • The overlap_len parameter specifies the overlap length between consecutive chunks if the split does not occur at the pattern.

  • Uses a Hugging Face tokenizer to calculate the text length in tokens if a tokenizer name is provided; otherwise, it uses the string length.

  • Caches the following stats: ‘chunk_count’ (number of chunks generated for each sample).

  • Raises a ValueError if both max_len and split_pattern are None or if overlap_len is greater than or equal to max_len.

根据指定的标准将输入文本拆分成多个块。

  • 使用指定的最大长度和拆分模式将输入文本拆分成多个块。

  • 如果提供了max_len,则将文本拆分成最大长度为max_len的块。

  • 如果提供了split_pattern,则在模式出现处拆分文本。如果长度超过max_len,则会强制切割。

  • overlap_len参数指定连续块之间的重叠长度,如果拆分不在模式处发生。

  • 如果提供了tokenizer名称,则使用Hugging Face tokenizer计算token长度;否则,使用字符串长度。

  • 缓存以下统计信息:’chunk_count’(为每个样本生成的块数)。

  • 如果max_lensplit_pattern都为None,或者overlap_len大于或等于max_len,则引发ValueError

Type 算子类型: mapper

Tags 标签: cpu, api, text

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

max_len

typing.Optional[typing.Annotated[int, Gt(gt=0)]]

None

Split text into multi texts with this max len if not None.

split_pattern

typing.Optional[str]

'\n\n'

Make sure split in this pattern if it is not None and force cut if the length exceeds max_len.

overlap_len

typing.Annotated[int, Ge(ge=0)]

0

Overlap length of the split texts if not split in the split pattern.

tokenizer

typing.Optional[str]

None

The tokenizer name of Hugging Face tokenizers. The text length will be calculate as the token num if it is offered. Otherwise, the text length equals to string length. Support tiktoken tokenizer (such as gpt-4o), dashscope tokenizer ( such as qwen2.5-72b-instruct) and huggingface tokenizer.

trust_remote_code

<class ‘bool’>

False

whether to trust the remote code of HF models.

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_naive_text_chunk

TextChunkMapper(split_pattern='\n')

📥 input data 输入数据

Sample 1: text
Today is Sunday and it's a happy day!
Sample 2: text
Sur la plateforme MT4, plusieurs manières d'accéder à 
ces fonctionnalités sont conçues simultanément.
Sample 3: text
欢迎来到阿里巴巴!

📤 output data 输出数据

Sample 1: text
Today is Sunday and it's a happy day!
Sample 2: text
Sur la plateforme MT4, plusieurs manières d'accéder à 
Sample 3: text
ces fonctionnalités sont conçues simultanément.
Sample 4: text
欢迎来到阿里巴巴!

✨ explanation 解释

This example shows how the operator splits the input text into chunks based on a specified split pattern. Here, the split pattern is ‘\n’, which means the text will be split at each newline character. In this case, only the second sample contains a newline, so it is split into two parts. The other samples do not contain newlines and remain unchanged. 这个例子展示了算子如何根据指定的分割模式将输入文本分割成多个块。这里,分割模式是’\n’,意味着文本会在每个换行符处被分割。在这种情况下,只有第二个样本包含换行符,因此它被分成两部分。其他样本不包含换行符,所以保持不变。

test_max_len_text_chunk

TextChunkMapper(max_len=20, split_pattern=None)

📥 input data 输入数据

Sample 1: text
Today is Sunday and it's a happy day!
Sample 2: text
Sur la plateforme MT4, plusieurs manières d'accéder à ces fonctionnalités sont conçues simultanément.
Sample 3: text
欢迎来到阿里巴巴!

📤 output data 输出数据

Sample 1: text
Today is Sunday and 
Sample 2: text
it's a happy day!
Sample 3: text
Sur la plateforme MT
Sample 4: text
4, plusieurs manière
Sample 5: text
s d'accéder à ces fo
Sample 6: text
nctionnalités sont c
Sample 7: text
onçues simultanément
Sample 8: text
.
Sample 9: text
欢迎来到阿里巴巴!

✨ explanation 解释

This example demonstrates how the operator splits the input text into chunks with a maximum length of 20 characters. The text is split into multiple segments, each no longer than 20 characters. If a word or phrase is cut off, it is included in the next segment. This ensures that the text is divided into manageable pieces without breaking words in the middle. 这个例子展示了算子如何将输入文本分割成最大长度为20个字符的多个块。文本被分割成多个段,每段不超过20个字符。如果某个词或短语被截断,它会被包含在下一段中。这样可以确保文本被分割成易于处理的部分,而不会在中间打断单词。