data_juicer.ops.filter.video_watermark_filter module¶
- class data_juicer.ops.filter.video_watermark_filter.VideoWatermarkFilter(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[source]¶
Bases:
Filter
Filter to keep samples whose videos have no watermark with high probability.
- __init__(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[source]¶
Initialization method.
- Parameters:
hf_watermark_model – watermark detection model name on huggingface.
prob_threshold – the predicted watermark probability threshold for samples. range from 0 to 1. Samples with watermark probability less than this threshold will be kept.
frame_sampling_method – sampling method of extracting frame images from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.
frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
reduce_mode – reduce mode for multiple sampled video frames. ‘avg’: Take the average of multiple values ‘max’: Take the max of multiple values ‘min’: Take the min of multiple values
any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.
args – extra args
kwargs – extra args
- compute_stats_single(sample, rank=None, context=False)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats