data_juicer.ops.filter.character_repetition_filter module

class data_juicer.ops.filter.character_repetition_filter.CharacterRepetitionFilter(rep_len: Annotated[int, Gt(gt=0)] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with character-level n-gram repetition ratio within a specific range.

This operator calculates the character-level n-gram repetition ratio for each sample and filters out samples that do not fall within the specified range. The repetition ratio is computed based on the frequency of n-grams in the text. The key metric ‘char_rep_ratio’ is cached in the stats field. Samples are kept if their ‘char_rep_ratio’ is between the specified min and max ratios. The n-gram length, minimum, and maximum ratios are configurable.

__init__(rep_len: Annotated[int, Gt(gt=0)] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[source]

Initialization method.

Parameters:
  • rep_len – Repetition length for char-level n-gram.

  • min_ratio – The min filter ratio in this op, samples will be filtered if their char-level n-gram repetition ratio is below this parameter.

  • max_ratio – The max filter ratio in this op, samples will be filtered if their char-level n-gram repetition ratio exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples)[source]
process_batched(samples)[source]