data_juicer.ops.filter.stopwords_filter module¶
- class data_juicer.ops.filter.stopwords_filter.StopWordsFilter(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]¶
Bases:
Filter
Filter to keep samples with stopword ratio in a specified range.
- __init__(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]¶
Initialization method.
- Parameters:
lang – Consider stopwords in what language. If lang == “all”, we will adopt the one merged from all the available languages
tokenization – whether to use model to tokenize documents
min_ratio – The min filter ratio in this op.
max_ratio – The max filter ratio in this op.
stopwords_dir – The directory storing the stopwords file(s) whose name includes “stopwords” and in json format
use_words_aug – Whether to augment words, especially for Chinese and Vietnamese
words_aug_group_sizes – The group size of words to augment
words_aug_join_char – The join char between words to augment
args – extra args
kwargs – extra args
- compute_stats_single(sample, context=False)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats