data_juicer.ops.filter.alphanumeric_filter module¶
- class data_juicer.ops.filter.alphanumeric_filter.AlphanumericFilter(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[source]¶
Bases:
Filter
Filter to keep samples with alphabet/numeric ratio within a specific range.
- __init__(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
tokenization – Whether to count the ratio of alphanumeric to the total number of tokens. if tokenization=False, it will count the ratio of alphanumeric to the total number of characters.
min_ratio – The min filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio is below this parameter.
max_ratio – The max filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio exceeds this parameter.
args – extra args
kwargs – extra args