video_deduplicator¶

Deduplicates samples at the document level using exact matching of videos.

This operator computes a hash for each video in the sample and uses it to identify and remove duplicate documents. If consider_text is set to True, it also considers the text hash alongside the video hash for deduplication. The video hash is computed by hashing the video data, including all video streams in the container. The operator supports sampling and tracing of duplicate pairs when the show_num parameter is greater than 0. Important fields used for caching include ‘videohash’ and optionally ‘hash’ if text is considered.

在文档级别使用视频的精确匹配去重样本。

该算子为样本中的每个视频计算一个哈希值，并使用它来识别和删除重复的文档。如果 consider_text 设置为 True，它还会考虑文本哈希值与视频哈希值一起进行去重。视频哈希值通过对视频数据（包括容器中的所有视频流）进行哈希计算得到。当 show_num 参数大于 0 时，该算子支持对重复对进行采样和追踪。用于缓存的重要字段包括 ‘videohash’，如果考虑文本则还包括 ‘hash’。

Type 算子类型: deduplicator

Tags 标签: cpu, video

🔧 Parameter Configuration 参数配置¶

name 参数名	type 类型	default 默认值	desc 说明
`consider_text`	<class ‘bool’>	`False`	whether to consider text hash together with video hash when applying deduplication.
`args`		`''`	extra args
`kwargs`		`''`	extra args

📊 Effect demonstration 效果演示¶

test_2¶

VideoDeduplicator()

📥 input data 输入数据¶

Sample 1: 1 video

video1.mp4:

Sample 2: 1 video

video2.mp4:

Sample 3: 1 video

video2.mp4:

📤 output data 输出数据¶

Sample 1: 1 video

video1.mp4:

Sample 2: 1 video

video2.mp4:

✨ explanation 解释¶

The operator removes duplicate videos based on their content. In this case, the second and third samples contain the same video (video2_path), so the third sample is removed, keeping only one copy of each unique video. 算子根据视频内容去除重复的视频。在这个例子中，第二个和第三个样本包含相同的视频（video2_path），因此移除了第三个样本，只保留每个唯一视频的一个副本。

test_3_consider_text¶

VideoDeduplicator(consider_text=True)

📥 input data 输入数据¶

Sample 1: text | 1 video

<video> text1

video1.mp4:

Sample 2: text | 1 video

<video> text2

video2.mp4:

Sample 3: text | 1 video

<video> text3

video3.mp4:

Sample 4: text | 1 video

<video> text1

video6.mp4:

Sample 5: text | 1 video

<video> text5

video7.mp4:

Sample 6: text | 1 video

<video> text3

video8.mp4:

Sample 7: text | 1 video

<video> text7

video9.mp4:

📤 output data 输出数据¶

Sample 1: text | 1 video

<video> text1

video1.mp4:

Sample 2: text | 1 video

<video> text2

video2.mp4:

Sample 3: text | 1 video

<video> text3

video3.mp4:

Sample 4: text | 1 video

<video> text5

video7.mp4:

Sample 5: text | 1 video

<video> text7

video9.mp4:

✨ explanation 解释¶

This test demonstrates the operator’s ability to deduplicate documents considering both video and text. It keeps the first occurrence of each unique video-text pair and removes subsequent duplicates. For instance, the fourth sample, which has the same video as the first but different text, is kept, while the sixth sample, having the same video and text as the third, is removed. 此测试展示了算子同时考虑视频和文本去重的能力。它保留每对独特视频-文本首次出现，并移除后续的重复项。例如，第四个样本虽然与第一个样本有相同的视频但文本不同而被保留，而第六个样本由于与第三个样本的视频和文本都相同而被移除。

video_deduplicator¶

🔧 Parameter Configuration 参数配置¶

📊 Effect demonstration 效果演示¶

test_2¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

test_3_consider_text¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

🔗 related links 相关链接¶