data_juicer.ops.mapper.text_chunk_mapper module

class data_juicer.ops.mapper.text_chunk_mapper.TextChunkMapper(max_len: Annotated[int, Gt(gt=0)] | None = None, split_pattern: str | None = '\\n\\n', overlap_len: Annotated[int, Ge(ge=0)] = 0, tokenizer: str | None = None, trust_remote_code: bool = False, *args, **kwargs)[源代码]

基类:Mapper

Split input text into chunks based on specified criteria.

  • Splits the input text into multiple chunks using a specified maximum length and a split pattern.

  • If max_len is provided, the text is split into chunks with a maximum length of max_len.

  • If split_pattern is provided, the text is split at occurrences of the pattern. If the length exceeds max_len, it will force a cut.

  • The overlap_len parameter specifies the overlap length between consecutive chunks if the split does not occur at the pattern.

  • Uses a Hugging Face tokenizer to calculate the text length in tokens if a tokenizer name is provided; otherwise, it uses the string length.

  • Caches the following stats: 'chunk_count' (number of chunks generated for each sample).

  • Raises a ValueError if both max_len and split_pattern are None or if overlap_len is greater than or equal to max_len.

__init__(max_len: Annotated[int, Gt(gt=0)] | None = None, split_pattern: str | None = '\\n\\n', overlap_len: Annotated[int, Ge(ge=0)] = 0, tokenizer: str | None = None, trust_remote_code: bool = False, *args, **kwargs)[源代码]

Initialization method.

参数:
  • max_len -- Split text into multi texts with this max len if not None.

  • split_pattern -- Make sure split in this pattern if it is not None and force cut if the length exceeds max_len.

  • overlap_len -- Overlap length of the split texts if not split in the split pattern.

  • tokenizer -- The tokenizer name of Hugging Face tokenizers. The text length will be calculate as the token num if it is offered. Otherwise, the text length equals to string length. Support tiktoken tokenizer (such as gpt-4o), dashscope tokenizer ( such as qwen2.5-72b-instruct) and huggingface tokenizer.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • args -- extra args

  • kwargs -- extra args

recursively_chunk(text)[源代码]
get_text_chunks(text, rank=None)[源代码]
process_batched(samples, rank=None)[源代码]