data_juicer.ops.filter.alphanumeric_filter module

class data_juicer.ops.filter.alphanumeric_filter.AlphanumericFilter(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with alphabet/numeric ratio within a specific range.

__init__(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[源代码]

Initialization method.

参数:
  • tokenization -- Whether to count the ratio of alphanumeric to the total number of tokens. if tokenization=False, it will count the ratio of alphanumeric to the total number of characters.

  • min_ratio -- The min filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio is below this parameter.

  • max_ratio -- The max filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples)[源代码]
process_batched(samples)[源代码]