data_juicer.ops.filter.text_length_filter module

class data_juicer.ops.filter.text_length_filter.TextLengthFilter(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with total text length within a specific range.

__init__(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_len – The min text length in the filtering. samples will be filtered if their text length is below this parameter.

  • max_len – The max text length in the filtering. samples will be filtered if their text length exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples)[source]
process_batched(samples)[source]