data_juicer.ops.filter.llm_perplexity_filter module¶
- class data_juicer.ops.filter.llm_perplexity_filter.LLMPerplexityFilter(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[source]¶
Bases:
Filter
Filter to keep samples with perplexity scores within a specified range, computed using a specified LLM.
This operator computes the perplexity score for each sample using a Hugging Face LLM. It then filters the samples based on whether their perplexity scores fall within the specified minimum and maximum score range. The perplexity score is calculated as the exponential of the loss value from the LLM. The operator uses a query and response template to format the input text for the LLM. If the perplexity score is not already cached in the sample’s stats under the key ‘llm_perplexity’, it will be computed.
- __init__(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
hf_model – huggingface embedding model name.
model_params – Parameters for initializing the API model.
min_score – Minimum perplexity score.
max_score – Maximum perplexity score.
query_template – Template for building the query string.
response_template – Template for building the response string.
args – extra args
kwargs – extra args
- compute_stats_single(sample, rank=None)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats