data_juicer.ops.deduplicator.image_deduplicator module¶

data_juicer.ops.deduplicator.image_deduplicator.get_hash_method(method_name)[源代码]¶

class data_juicer.ops.deduplicator.image_deduplicator.ImageDeduplicator(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[源代码]¶

基类：Deduplicator

Deduplicates samples at the document level by exact matching of images.

This operator compares images across documents to identify and remove duplicates. - It uses a specified hash method (default is 'phash') to compute image hashes. - If consider_text is set, it also considers text content for deduplication, using a text deduplicator in conjunction with the image hashes. - The key metric, imagehash, is computed for each sample. If consider_text is enabled, an additional hash field is used. - Duplicates are identified by comparing these hash values. Samples with identical hashes are considered duplicates. - When show_num is greater than 0, the operator also returns a subset of duplicate pairs for tracing purposes. - The operator caches the imagehash and, if applicable, the hash fields.

__init__(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[源代码]¶

Initialization method.

参数:

method -- hash method for image
consider_text -- whether to consider text hash together with image hash when applying deduplication.
args -- extra args
kwargs -- extra args

compute_hash(sample, context=False)[源代码]¶

Compute hash values for the sample.

参数:: sample -- input sample
返回:: sample with computed hash value.

process(dataset, show_num=0)[源代码]¶

For doc-level, dataset --> dataset.

参数:

dataset -- input dataset
show_num -- number of traced samples used when tracer is open.

返回:

deduplicated dataset and the sampled duplicate pairs.