data_juicer.ops.filter.perplexity_filter module¶
- class data_juicer.ops.filter.perplexity_filter.PerplexityFilter(lang: str = 'en', min_ppl: float = 0, max_ppl: float = 1500, *args, **kwargs)[源代码]¶
基类:
Filter
Filter to keep samples with perplexity score in a specified range.
This operator computes the perplexity of text samples using a Hugging Face tokenizer and a KenLM language model. It keeps samples with perplexity scores within the specified minimum and maximum values. The perplexity is calculated character-based by default. If the perplexity is already computed, it will be reused from the 'perplexity' field in the sample's stats. The operator supports batched operations for efficiency.
- __init__(lang: str = 'en', min_ppl: float = 0, max_ppl: float = 1500, *args, **kwargs)[源代码]¶
Initialization method.
- 参数:
lang -- Compute perplexity for samples in which language.
min_ppl -- The min filter perplexity in this op.
max_ppl -- The max filter perplexity in this op.
args -- extra args
kwargs -- extra args