data_juicer.ops.deduplicator.image_deduplicator module

data_juicer.ops.deduplicator.image_deduplicator.get_hash_method(method_name)[源代码]
class data_juicer.ops.deduplicator.image_deduplicator.ImageDeduplicator(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[源代码]

基类:Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching of images between documents.

__init__(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[源代码]

Initialization method.

参数:
  • method -- hash method for image

  • consider_text -- whether to consider text hash together with image hash when applying deduplication.

  • args -- extra args

  • kwargs -- extra args

compute_hash(sample, context=False)[源代码]

Compute hash values for the sample.

参数:

sample -- input sample

返回:

sample with computed hash value.

process(dataset, show_num=0)[源代码]

For doc-level, dataset --> dataset.

参数:
  • dataset -- input dataset

  • show_num -- number of traced samples used when tracer is open.

返回:

deduplicated dataset and the sampled duplicate pairs.