data_juicer.ops.filter.llm_perplexity_filter module

class data_juicer.ops.filter.llm_perplexity_filter.LLMPerplexityFilter(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with perplexity score, computed using a specified llm, within a specific range.

__init__(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – huggingface embedding model name.

  • model_params – Parameters for initializing the API model.

  • min_score – Minimum perplexity score.

  • max_score – Maximum perplexity score.

  • query_template – Template for building the query string.

  • response_template – Template for building the response string.

  • args – extra args

  • kwargs – extra args

sample_with_messages(sample, system_prompt=None)[source]
compute_stats_single(sample, rank=None)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering