data_juicer.ops.filter.text_length_filter module

class data_juicer.ops.filter.text_length_filter.TextLengthFilter(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with total text length within a specific range.

This operator filters out samples based on their total text length. It retains samples where the text length is between the specified minimum and maximum lengths. The text length is computed as the number of characters in the sample’s text. If the ‘text_len’ key is already present in the sample’s stats, it will be reused; otherwise, it will be computed. The operator processes samples in batches for efficiency.

__init__(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_len – The min text length in the filtering. samples will be filtered if their text length is below this parameter.

  • max_len – The max text length in the filtering. samples will be filtered if their text length exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples)[source]
process_batched(samples)[source]