data_juicer.ops.filter.video_nsfw_filter module

class data_juicer.ops.filter.video_nsfw_filter.VideoNSFWFilter(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, min_score: float = 0.0, max_score: float = 0.5, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples whose videos have nsfw scores in a specified range.

__init__(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, min_score: float = 0.0, max_score: float = 0.5, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_nsfw_model – nsfw detection model name on huggingface.

  • max_score – the nsfw score threshold for samples. range from 0 to 1. Samples with nsfw score less than this threshold will be kept.

  • frame_sampling_method – sampling method of extracting frame images from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • reduce_mode – reduce mode for multiple sampled video frames. ‘avg’: Take the average of multiple values ‘max’: Take the max of multiple values ‘min’: Take the min of multiple values

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering