data_juicer.ops.filter.special_characters_filter module¶
- class data_juicer.ops.filter.special_characters_filter.SpecialCharactersFilter(min_ratio: float = 0.0, max_ratio: float = 0.25, *args, **kwargs)[source]¶
Bases:
Filter
Filter to keep samples with special-character ratio within a specific range.
This operator filters out samples based on the ratio of special characters in the text. It keeps samples where the special-character ratio is within the specified minimum and maximum thresholds. The special-character ratio is computed as the number of special characters divided by the total number of characters in the text. If the ‘special_char_ratio’ is already cached in the stats, it will be reused. Otherwise, it will be computed and stored in the ‘special_char_ratio’ field.
- __init__(min_ratio: float = 0.0, max_ratio: float = 0.25, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
min_ratio – The min filter ratio in this op, samples will be filtered if their special-char ratio is below this parameter.
max_ratio – The max filter ratio in this op, samples will be filtered if their special-char ratio exceeds this parameter.
args – extra args
kwargs – extra args