data_juicer.ops.filter.stopwords_filter module

class data_juicer.ops.filter.stopwords_filter.StopWordsFilter(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with stopword ratio within a specified range.

This operator calculates the ratio of stopwords in a sample and keeps samples where this ratio is between the specified minimum and maximum values. The stopword ratio is computed as the number of stopwords divided by the total number of words. If the tokenization parameter is set, a Hugging Face tokenizer is used to tokenize the text. The stopwords are loaded from a directory, and if the language is set to “all”, it merges stopwords from all available languages. The key metric is stopwords_ratio, which is character-based by default. The operator also supports word augmentation for specific languages.

__init__(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – Consider stopwords in what language. If lang == “all”, we will adopt the one merged from all the available languages

  • tokenization – whether to use model to tokenize documents

  • min_ratio – The min filter ratio in this op.

  • max_ratio – The max filter ratio in this op.

  • stopwords_dir – The directory storing the stopwords file(s) whose name includes “stopwords” and in json format

  • use_words_aug – Whether to augment words, especially for Chinese and Vietnamese

  • words_aug_group_sizes – The group size of words to augment

  • words_aug_join_char – The join char between words to augment

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering