data_juicer.ops.filter.image_watermark_filter module¶

class data_juicer.ops.filter.image_watermark_filter.ImageWatermarkFilter(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, any_or_all: str = 'any', *args, **kwargs)[source]¶

Bases: Filter

Filter to keep samples whose images have no watermark with high probability.

This operator uses a Hugging Face watermark detection model to filter samples based on the presence of watermarks in their images. It keeps samples where the predicted watermark probability is below a specified threshold. The operator supports two strategies: ‘any’ (keep if any image meets the condition) and ‘all’ (keep only if all images meet the condition). The key metric ‘image_watermark_prob’ is computed for each image, representing the probability that the image contains a watermark. If no images are present in the sample, the metric is set to an empty array.

__init__(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, any_or_all: str = 'any', *args, **kwargs)[source]¶

Initialization method.

Parameters:

hf_watermark_model – watermark detection model name on huggingface.
trust_remote_code – whether to trust the remote code of HF models.
prob_threshold – the predicted watermark probability threshold for samples. range from 0 to 1. Samples with watermark probability less than this threshold will be kept.
any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.
args – extra args
kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]¶

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:

sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]¶

For sample level, sample –> Boolean.

Parameters:: sample – sample to decide whether to filter
Returns:: true for keeping and false for filtering