data_juicer.ops.filter.character_repetition_filter module¶
- class data_juicer.ops.filter.character_repetition_filter.CharacterRepetitionFilter(rep_len: Annotated[int, Gt(gt=0)] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[source]¶
Bases:
Filter
Filter to keep samples with char-level n-gram repetition ratio within a specific range.
- __init__(rep_len: Annotated[int, Gt(gt=0)] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
rep_len – Repetition length for char-level n-gram.
min_ratio – The min filter ratio in this op, samples will be filtered if their char-level n-gram repetition ratio is below this parameter.
max_ratio – The max filter ratio in this op, samples will be filtered if their char-level n-gram repetition ratio exceeds this parameter.
args – extra args
kwargs – extra args