data_juicer.ops.deduplicator.video_deduplicator module

class data_juicer.ops.deduplicator.video_deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicates samples at the document level using exact matching of videos.

This operator computes a hash for each video in the sample and uses it to identify and remove duplicate documents. If consider_text is set to True, it also considers the text hash alongside the video hash for deduplication. The video hash is computed by hashing the video data, including all video streams in the container. The operator supports sampling and tracing of duplicate pairs when the show_num parameter is greater than 0. Important fields used for caching include ‘videohash’ and optionally ‘hash’ if text is considered.

__init__(consider_text: bool = False, *args, **kwargs)[source]

Initialization.

Parameters:
  • consider_text – whether to consider text hash together with video hash when applying deduplication.

  • args – extra args

  • kwargs – extra args

compute_hash(sample, context=False)[source]

Compute hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with computed hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.