data_juicer.ops.filter.stopwords_filter module¶

class data_juicer.ops.filter.stopwords_filter.StopWordsFilter(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]¶

Bases: Filter

Filter to keep samples with stopword ratio within a specified range.

This operator calculates the ratio of stopwords in a sample and keeps samples where this ratio is between the specified minimum and maximum values. The stopword ratio is computed as the number of stopwords divided by the total number of words. If the tokenization parameter is set, a Hugging Face tokenizer is used to tokenize the text. The stopwords are loaded from a directory, and if the language is set to “all”, it merges stopwords from all available languages. The key metric is stopwords_ratio, which is character-based by default. The operator also supports word augmentation for specific languages.

__init__(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]¶

Initialization method.

Parameters:

lang – Consider stopwords in what language. If lang == “all”, we will adopt the one merged from all the available languages
tokenization – whether to use model to tokenize documents
min_ratio – The min filter ratio in this op.
max_ratio – The max filter ratio in this op.
stopwords_dir – The directory storing the stopwords file(s) whose name includes “stopwords” and in json format
use_words_aug – Whether to augment words, especially for Chinese and Vietnamese
words_aug_group_sizes – The group size of words to augment
words_aug_join_char – The join char between words to augment
args – extra args
kwargs – extra args

compute_stats_single(sample, context=False)[source]¶

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:

sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]¶

For sample level, sample –> Boolean.

Parameters:: sample – sample to decide whether to filter
Returns:: true for keeping and false for filtering