data_juicer.ops.filter.alphanumeric_filter module

class data_juicer.ops.filter.alphanumeric_filter.AlphanumericFilter(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with alphabet/numeric ratio within a specific range.

__init__(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • tokenization – Whether to count the ratio of alphanumeric to the total number of tokens. if tokenization=False, it will count the ratio of alphanumeric to the total number of characters.

  • min_ratio – The min filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio is below this parameter.

  • max_ratio – The max filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples)[source]
process_batched(samples)[source]