data_juicer.ops.filter.image_nsfw_filter module¶

class data_juicer.ops.filter.image_nsfw_filter.ImageNSFWFilter(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, min_score: float = 0.0, max_score: float = 0.5, any_or_all: str = 'any', *args, **kwargs)[source]¶

Bases: Filter

Filter to keep samples whose images have nsfw scores in a specified range.

This operator uses a Hugging Face model to compute the nsfw scores for each image in a sample. It keeps samples based on the specified min_score and max_score thresholds. The operator supports two strategies: ‘any’ (keep the sample if any image meets the condition) or ‘all’ (keep the sample only if all images meet the condition). The nsfw scores are cached in the ‘image_nsfw_score’ field of the sample’s stats.

__init__(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, min_score: float = 0.0, max_score: float = 0.5, any_or_all: str = 'any', *args, **kwargs)[source]¶

Initialization method.

Parameters:

hf_nsfw_model – nsfw detection model name on huggingface.
trust_remote_code – whether to trust the remote code of HF models.
min_score – the min nsfw score threshold for samples. range from 0 to 1.
max_score – the max nsfw score threshold for samples. range from 0 to 1.
any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.
args – extra args
kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]¶

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:

sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]¶

For sample level, sample –> Boolean.

Parameters:: sample – sample to decide whether to filter
Returns:: true for keeping and false for filtering