video_deduplicator

Deduplicates samples at the document level using exact matching of videos.

This operator computes a hash for each video in the sample and uses it to identify and remove duplicate documents. If consider_text is set to True, it also considers the text hash alongside the video hash for deduplication. The video hash is computed by hashing the video data, including all video streams in the container. The operator supports sampling and tracing of duplicate pairs when the show_num parameter is greater than 0. Important fields used for caching include ‘videohash’ and optionally ‘hash’ if text is considered.

在文档级别使用视频的精确匹配去重样本。

该算子为样本中的每个视频计算一个哈希值,并使用它来识别和删除重复的文档。如果 consider_text 设置为 True,它还会考虑文本哈希值与视频哈希值一起进行去重。视频哈希值通过对视频数据(包括容器中的所有视频流)进行哈希计算得到。当 show_num 参数大于 0 时,该算子支持对重复对进行采样和追踪。用于缓存的重要字段包括 ‘videohash’,如果考虑文本则还包括 ‘hash’。

Type 算子类型: deduplicator

Tags 标签: cpu, video

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

consider_text

<class ‘bool’>

False

whether to consider text hash together with video

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_2

VideoDeduplicator()

📥 input data 输入数据

Sample 1: 1 video
video1.mp4:
Sample 2: 1 video
video2.mp4:
Sample 3: 1 video
video2.mp4:

📤 output data 输出数据

Sample 1: 1 video
video1.mp4:
Sample 2: 1 video
video2.mp4:

✨ explanation 解释

The operator removes duplicate videos based on their content. In this case, the second and third samples contain the same video (video2_path), so the third sample is removed, keeping only one copy of each unique video. 算子根据视频内容去除重复的视频。在这个例子中,第二个和第三个样本包含相同的视频(video2_path),因此移除了第三个样本,只保留每个唯一视频的一个副本。

test_3_consider_text

VideoDeduplicator(consider_text=True)

📥 input data 输入数据

Sample 1: text | 1 video
<video> text1
video1.mp4:
Sample 2: text | 1 video
<video> text2
video2.mp4:
Sample 3: text | 1 video
<video> text3
video3.mp4:
Sample 4: text | 1 video
<video> text1
video6.mp4:
Sample 5: text | 1 video
<video> text5
video7.mp4:
Sample 6: text | 1 video
<video> text3
video8.mp4:
Sample 7: text | 1 video
<video> text7
video9.mp4:

📤 output data 输出数据

Sample 1: text | 1 video
<video> text1
video1.mp4:
Sample 2: text | 1 video
<video> text2
video2.mp4:
Sample 3: text | 1 video
<video> text3
video3.mp4:
Sample 4: text | 1 video
<video> text5
video7.mp4:
Sample 5: text | 1 video
<video> text7
video9.mp4:

✨ explanation 解释

This test demonstrates the operator’s ability to deduplicate documents considering both video and text. It keeps the first occurrence of each unique video-text pair and removes subsequent duplicates. For instance, the fourth sample, which has the same video as the first but different text, is kept, while the sixth sample, having the same video and text as the third, is removed. 此测试展示了算子同时考虑视频和文本去重的能力。它保留每对独特视频-文本首次出现,并移除后续的重复项。例如,第四个样本虽然与第一个样本有相同的视频但文本不同而被保留,而第六个样本由于与第三个样本的视频和文本都相同而被移除。