data_juicer.ops.deduplicator.ray_image_deduplicator module

data_juicer.ops.deduplicator.ray_image_deduplicator.get_hash_method(method_name)[source]
class data_juicer.ops.deduplicator.ray_image_deduplicator.RayImageDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', method: str = 'phash', *args, **kwargs)[source]

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching of images between documents.

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', method: str = 'phash', *args, **kwargs)[source]

Initialization. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param method: the hash method to use :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.