data_juicer.ops.deduplicator.video_deduplicator module¶

class data_juicer.ops.deduplicator.video_deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[源代码]¶

基类：Deduplicator

Deduplicates samples at the document level using exact matching of videos.

This operator computes a hash for each video in the sample and uses it to identify and remove duplicate documents. If consider_text is set to True, it also considers the text hash alongside the video hash for deduplication. The video hash is computed by hashing the video data, including all video streams in the container. The operator supports sampling and tracing of duplicate pairs when the show_num parameter is greater than 0. Important fields used for caching include 'videohash' and optionally 'hash' if text is considered.

__init__(consider_text: bool = False, *args, **kwargs)[源代码]¶

Initialization.

参数:

consider_text -- whether to consider text hash together with video hash when applying deduplication.
args -- extra args
kwargs -- extra args

compute_hash(sample, context=False)[源代码]¶

Compute hash values for the sample.

参数:: sample -- input sample
返回:: sample with computed hash value.

process(dataset, show_num=0)[源代码]¶

For doc-level, dataset --> dataset.

参数:

dataset -- input dataset
show_num -- number of traced samples used when tracer is open.

返回:

deduplicated dataset and the sampled duplicate pairs.