data_juicer.ops.filter.token_num_filter module¶
- class data_juicer.ops.filter.token_num_filter.TokenNumFilter(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]¶
Bases:
Filter
Filter to keep samples with total token number within a specific range.
- __init__(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
hf_tokenizer – the tokenizer name of Hugging Face tokenizers.
min_num – The min filter token number in this op, samples will be filtered if their token number is below this parameter.
max_num – The max filter token number in this op, samples will be filtered if their token number exceeds this parameter.
args – extra args
kwargs – extra args