data_juicer.ops.deduplicator.video_deduplicator module¶

class data_juicer.ops.deduplicator.video_deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[source]¶

Bases: Deduplicator

Deduplicates samples at the document level using exact matching of videos.

This operator computes a hash for each video in the sample and uses it to identify and remove duplicate documents. If consider_text is set to True, it also considers the text hash alongside the video hash for deduplication. The video hash is computed by hashing the video data, including all video streams in the container. The operator supports sampling and tracing of duplicate pairs when the show_num parameter is greater than 0. Important fields used for caching include ‘videohash’ and optionally ‘hash’ if text is considered.

__init__(consider_text: bool = False, *args, **kwargs)[source]¶

Initialization.

Parameters:

consider_text – whether to consider text hash together with video hash when applying deduplication.
args – extra args
kwargs – extra args

compute_hash(sample, context=False)[source]¶

Compute hash values for the sample.

Parameters:: sample – input sample
Returns:: sample with computed hash value.

process(dataset, show_num=0)[source]¶

For doc-level, dataset –> dataset.

Parameters:

dataset – input dataset
show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.