data_juicer.ops.filter.token_num_filter module

class data_juicer.ops.filter.token_num_filter.TokenNumFilter(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with total token number within a specific range.

__init__(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_tokenizer – the tokenizer name of Hugging Face tokenizers.

  • min_num – The min filter token number in this op, samples will be filtered if their token number is below this parameter.

  • max_num – The max filter token number in this op, samples will be filtered if their token number exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering