ray_video_deduplicator

Deduplicates samples at document-level using exact matching of videos in Ray distributed mode.

This operator computes the MD5 hash of video streams in each sample and compares them to identify duplicates. It uses Ray distributed mode for parallel processing. The hash is computed by demuxing the video streams and updating the MD5 hash with each video packet. If a sample does not contain a valid video, it is assigned an empty hash value. The operator supports ‘ray_actor’ or ‘redis’ backends for deduplication.

在 Ray 分布式模式下,使用视频的精确匹配在文档级别去重样本。

该算子计算每个样本中视频流的 MD5 哈希值,并通过比较这些哈希值来识别重复项。它使用 Ray 分布式模式进行并行处理。哈希值通过解复用视频流并对每个视频包更新 MD5 哈希值来计算。如果样本不包含有效的视频,则分配一个空的哈希值。该算子支持 ‘ray_actor’ 或 ‘redis’ 后端进行去重。

Type 算子类型: deduplicator

Tags 标签: cpu, video

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

backend

<class ‘str’>

'ray_actor'

the backend for dedup, either ‘ray_actor’ or ‘redis’

redis_address

<class ‘str’>

'redis://localhost:6379'

the address of redis server

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_2

RayVideoDeduplicator()

📥 input data 输入数据

Sample 1: 1 video
video1.mp4:
Sample 2: 1 video
video2.mp4:
Sample 3: 1 video
video2.mp4:

📤 output data 输出数据

Sample 1: 1 video
video1.mp4:
Sample 2: 1 video
video2.mp4:

✨ explanation 解释

The operator removes duplicate video entries based on their MD5 hash, keeping only the first occurrence of each unique video. In this case, the second and third samples contain the same video, so the third sample is removed. 算子根据视频的MD5哈希值删除重复的视频条目,只保留每个唯一视频的第一次出现。在这种情况下,第二个和第三个样本包含相同的视频,因此移除了第三个样本。

test_4

RayVideoDeduplicator()

📥 input data 输入数据

Sample 1: 3 videos
video1.mp4 +2 more:
Show 2 more videos 展开更多视频
Sample 2: 3 videos
video6.mp4 +2 more:
Show 2 more videos 展开更多视频
Sample 3: 2 videos
video9.mp4 +1 more:
Show 1 more videos 展开更多视频
Sample 4: 2 videos
video8.mp4 +1 more:
Show 1 more videos 展开更多视频

📤 output data 输出数据

Sample 1: 3 videos
video1.mp4 +2 more:
Show 2 more videos 展开更多视频
Sample 2: 2 videos
video9.mp4 +1 more:
Show 1 more videos 展开更多视频

✨ explanation 解释

This test illustrates how the deduplicator handles multiple videos within a single sample. The operator ensures that across all samples, if there are duplicate videos, only the first occurrence of those duplicates in any sample is kept. Here, the fourth sample is removed because it contains videos already present in earlier samples. 此测试展示了去重器如何处理单个样本中的多个视频。算子确保在所有样本中,如果有重复的视频,只保留这些重复项在任何样本中的第一次出现。这里,第四个样本被移除是因为它包含的视频已经在之前的样本中出现了。