data_juicer.ops.filter.llm_perplexity_filter module

class data_juicer.ops.filter.llm_perplexity_filter.LLMPerplexityFilter(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with perplexity scores within a specified range, computed using a specified LLM.

This operator computes the perplexity score for each sample using a Hugging Face LLM. It then filters the samples based on whether their perplexity scores fall within the specified minimum and maximum score range. The perplexity score is calculated as the exponential of the loss value from the LLM. The operator uses a query and response template to format the input text for the LLM. If the perplexity score is not already cached in the sample’s stats under the key ‘llm_perplexity’, it will be computed.

__init__(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – huggingface embedding model name.

  • model_params – Parameters for initializing the API model.

  • min_score – Minimum perplexity score.

  • max_score – Maximum perplexity score.

  • query_template – Template for building the query string.

  • response_template – Template for building the response string.

  • args – extra args

  • kwargs – extra args

sample_with_messages(sample, system_prompt=None)[source]
compute_stats_single(sample, rank=None)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering