data_juicer.ops.filter.token_num_filter module¶
- class data_juicer.ops.filter.token_num_filter.TokenNumFilter(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]¶
Bases:
Filter
Filter to keep samples with a total token number within a specified range.
This operator uses a Hugging Face tokenizer to count the number of tokens in each sample. It keeps samples where the token count is between the minimum and maximum thresholds. The token count is stored in the ‘num_token’ field of the sample’s stats. If the token count is not already computed, it will be calculated using the specified tokenizer.
- __init__(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
hf_tokenizer – the tokenizer name of Hugging Face tokenizers.
min_num – The min filter token number in this op, samples will be filtered if their token number is below this parameter.
max_num – The max filter token number in this op, samples will be filtered if their token number exceeds this parameter.
args – extra args
kwargs – extra args