data_juicer.ops.deduplicator.ray_video_deduplicator module¶
- class data_juicer.ops.deduplicator.ray_video_deduplicator.RayVideoDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[source]¶
Bases:
RayBasicDeduplicator
Deduplicates samples at document-level using exact matching of videos in Ray distributed mode.
This operator computes the MD5 hash of video streams in each sample and compares them to identify duplicates. It uses Ray distributed mode for parallel processing. The hash is computed by demuxing the video streams and updating the MD5 hash with each video packet. If a sample does not contain a valid video, it is assigned an empty hash value. The operator supports ‘ray_actor’ or ‘redis’ backends for deduplication.