data_juicer.ops.deduplicator.video_deduplicator module¶
- class data_juicer.ops.deduplicator.video_deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[源代码]¶
基类:
Deduplicator
Deduplicates samples at the document level using exact matching of videos.
This operator computes a hash for each video in the sample and uses it to identify and remove duplicate documents. If consider_text is set to True, it also considers the text hash alongside the video hash for deduplication. The video hash is computed by hashing the video data, including all video streams in the container. The operator supports sampling and tracing of duplicate pairs when the show_num parameter is greater than 0. Important fields used for caching include 'videohash' and optionally 'hash' if text is considered.
- __init__(consider_text: bool = False, *args, **kwargs)[源代码]¶
Initialization.
- 参数:
consider_text -- whether to consider text hash together with video hash when applying deduplication.
args -- extra args
kwargs -- extra args