data_juicer.ops.filter.token_num_filter module

class data_juicer.ops.filter.token_num_filter.TokenNumFilter(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with a total token number within a specified range.

This operator uses a Hugging Face tokenizer to count the number of tokens in each sample. It keeps samples where the token count is between the minimum and maximum thresholds. The token count is stored in the ‘num_token’ field of the sample’s stats. If the token count is not already computed, it will be calculated using the specified tokenizer.

__init__(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_tokenizer – the tokenizer name of Hugging Face tokenizers.

  • min_num – The min filter token number in this op, samples will be filtered if their token number is below this parameter.

  • max_num – The max filter token number in this op, samples will be filtered if their token number exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering