data_juicer.ops.deduplicator.ray_video_deduplicator module

class data_juicer.ops.deduplicator.ray_video_deduplicator.RayVideoDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[源代码]

基类:RayBasicDeduplicator

Deduplicates samples at document-level using exact matching of videos in Ray distributed mode.

This operator computes the MD5 hash of video streams in each sample and compares them to identify duplicates. It uses Ray distributed mode for parallel processing. The hash is computed by demuxing the video streams and updating the MD5 hash with each video packet. If a sample does not contain a valid video, it is assigned an empty hash value. The operator supports 'ray_actor' or 'redis' backends for deduplication.

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[源代码]

Initialization. :param backend: the backend for dedup, either 'ray_actor' or 'redis' :param redis_address: the address of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[源代码]

Calculate hash value for the sample.