data_juicer.ops.filter.special_characters_filter module

class data_juicer.ops.filter.special_characters_filter.SpecialCharactersFilter(min_ratio: float = 0.0, max_ratio: float = 0.25, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with special-char ratio within a specific range.

__init__(min_ratio: float = 0.0, max_ratio: float = 0.25, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_ratio – The min filter ratio in this op, samples will be filtered if their special-char ratio is below this parameter.

  • max_ratio – The max filter ratio in this op, samples will be filtered if their special-char ratio exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples)[source]
process_batched(samples)[source]