data_juicer.ops.filter.alphanumeric_filter module

class data_juicer.ops.filter.alphanumeric_filter.AlphanumericFilter(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with an alphabet/numeric ratio within a specific range.

This operator filters samples based on the ratio of alphanumeric characters or tokens. It keeps samples where the ratio of alphanumeric characters (or tokens) to the total number of characters (or tokens) is within the specified range. The ratio is computed either character-based or token-based, depending on the tokenization parameter. If tokenization is True, it uses a Hugging Face tokenizer to count tokens. The key metric used for filtering is ‘alpha_token_ratio’ if tokenization is enabled, otherwise ‘alnum_ratio’. The operator caches these metrics in the stats field for each sample.

__init__(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • tokenization – Whether to count the ratio of alphanumeric to the total number of tokens. if tokenization=False, it will count the ratio of alphanumeric to the total number of characters.

  • min_ratio – The min filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio is below this parameter.

  • max_ratio – The max filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples)[source]
process_batched(samples)[source]