data_juicer.ops.filter.perplexity_filter module¶

class data_juicer.ops.filter.perplexity_filter.PerplexityFilter(lang: str = 'en', min_ppl: float = 0, max_ppl: float = 1500, *args, **kwargs)[源代码]¶

基类：Filter

Filter to keep samples with perplexity score in a specified range.

This operator computes the perplexity of text samples using a Hugging Face tokenizer and a KenLM language model. It keeps samples with perplexity scores within the specified minimum and maximum values. The perplexity is calculated character-based by default. If the perplexity is already computed, it will be reused from the 'perplexity' field in the sample's stats. The operator supports batched operations for efficiency.

__init__(lang: str = 'en', min_ppl: float = 0, max_ppl: float = 1500, *args, **kwargs)[源代码]¶

Initialization method.

参数:

lang -- Compute perplexity for samples in which language.
min_ppl -- The min filter perplexity in this op.
max_ppl -- The max filter perplexity in this op.
args -- extra args
kwargs -- extra args

compute_stats_batched(samples, context=False)[源代码]¶

process_batched(samples)[源代码]¶