data_juicer.ops.deduplicator.image_deduplicator module¶
- class data_juicer.ops.deduplicator.image_deduplicator.ImageDeduplicator(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[源代码]¶
基类:
Deduplicator
Deduplicator to deduplicate samples at document-level using exact matching of images between documents.
- __init__(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[源代码]¶
Initialization method.
- 参数:
method -- hash method for image
consider_text -- whether to consider text hash together with image hash when applying deduplication.
args -- extra args
kwargs -- extra args