ray_image_deduplicator

Deduplicates samples at the document level using exact matching of images in Ray distributed mode.

This operator uses a specified hash method to compute image hashes and identifies duplicates by comparing these hashes. It operates in Ray distributed mode, supporting ‘ray_actor’ or ‘redis’ backends for deduplication. The hash method can be set during initialization, with supported methods listed in HASH_METHOD. If a sample does not contain an image, it is assigned an empty hash value. The operator loads images from the specified keys and computes their combined hash for comparison.

在 Ray 分布式模式下,使用图像的精确匹配在文档级别去重样本。

该算子使用指定的哈希方法计算图像哈希值,并通过比较这些哈希值来识别重复项。它在 Ray 分布式模式下运行,支持 ‘ray_actor’ 或 ‘redis’ 后端进行去重。哈希方法可以在初始化时设置,支持的方法列在 HASH_METHOD 中。如果样本不包含图像,则分配一个空的哈希值。该算子从指定的键加载图像并计算它们的组合哈希值以进行比较。

Type 算子类型: deduplicator

Tags 标签: cpu, image

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

backend

<class ‘str’>

'ray_actor'

the backend for dedup, either ‘ray_actor’ or ‘redis’

redis_address

<class ‘str’>

'redis://localhost:6379'

the address of redis server

method

<class ‘str’>

'phash'

the hash method to use

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_2

RayImageDeduplicator()

📥 input data 输入数据

Sample 1: 1 image
img1.png:
Sample 2: 1 image
img2.jpg:
Sample 3: 1 image
img2.jpg:

📤 output data 输出数据

Sample 1: 1 image
img1.png:
Sample 2: 1 image
img2.jpg:

✨ explanation 解释

The operator identifies and removes duplicate images based on their hash values. In this case, the second and third samples contain the same image, so one of them is removed, resulting in only two unique samples being kept. 算子根据图像的哈希值识别并移除重复的图像。在这种情况下,第二个和第三个样本包含相同的图像,因此其中一个被移除,最终只保留了两个唯一的样本。

test_3

RayImageDeduplicator()

📥 input data 输入数据

Sample 1: 1 image
img1.png:
Sample 2: 1 image
img2.jpg:
Sample 3: 1 image
img3.jpg:
Sample 4: 1 image
img4.png:
Sample 5: 1 image
img5.jpg:
Sample 6: 1 image
img6.jpg:
Sample 7: 1 image
img7.jpg:

📤 output data 输出数据

Sample 1: 1 image
img1.png:
Sample 2: 1 image
img2.jpg:
Sample 3: 1 image
img3.jpg:

✨ explanation 解释

This test demonstrates the deduplication process when there are multiple duplicates of the same images. The operator retains only the first occurrence of each unique image, removing all subsequent duplicates. This results in the target list containing only the initial three unique images, despite the presence of duplicates in the input dataset. 此测试展示了当存在多份相同图像副本时的去重过程。算子仅保留每个唯一图像的首次出现,并移除所有后续的重复项。这导致目标列表中只包含最初的三个唯一图像,尽管输入数据集中存在重复项。