ray_image_deduplicator¶

Deduplicates samples at the document level using exact matching of images in Ray distributed mode.

This operator uses a specified hash method to compute image hashes and identifies duplicates by comparing these hashes. It operates in Ray distributed mode, supporting ‘ray_actor’ or ‘redis’ backends for deduplication. The hash method can be set during initialization, with supported methods listed in HASH_METHOD. If a sample does not contain an image, it is assigned an empty hash value. The operator loads images from the specified keys and computes their combined hash for comparison.

在 Ray 分布式模式下，使用图像的精确匹配在文档级别去重样本。

该算子使用指定的哈希方法计算图像哈希值，并通过比较这些哈希值来识别重复项。它在 Ray 分布式模式下运行，支持 ‘ray_actor’ 或 ‘redis’ 后端进行去重。哈希方法可以在初始化时设置，支持的方法列在 HASH_METHOD 中。如果样本不包含图像，则分配一个空的哈希值。该算子从指定的键加载图像并计算它们的组合哈希值以进行比较。

Type 算子类型: deduplicator

Tags 标签: cpu, image

🔧 Parameter Configuration 参数配置¶

name 参数名	type 类型	default 默认值	desc 说明
`backend`	<class ‘str’>	`'ray_actor'`	the backend for dedup, either ‘ray_actor’ or ‘redis’
`redis_address`	<class ‘str’>	`'redis://localhost:6379'`	the address of redis server
`method`	<class ‘str’>	`'phash'`	the hash method to use
`args`		`''`	extra args
`kwargs`		`''`	extra args

📊 Effect demonstration 效果演示¶

test_2¶

RayImageDeduplicator()

📥 input data 输入数据¶

Sample 1: 1 image

img1.png:

Sample 2: 1 image

img2.jpg:

Sample 3: 1 image

img2.jpg:

📤 output data 输出数据¶

Sample 1: 1 image

img1.png:

Sample 2: 1 image

img2.jpg:

✨ explanation 解释¶

The operator identifies and removes duplicate images based on their hash values. In this case, the second and third samples contain the same image, so one of them is removed, resulting in only two unique samples being kept. 算子根据图像的哈希值识别并移除重复的图像。在这种情况下，第二个和第三个样本包含相同的图像，因此其中一个被移除，最终只保留了两个唯一的样本。

test_3¶

RayImageDeduplicator()

📥 input data 输入数据¶

Sample 1: 1 image

img1.png:

Sample 2: 1 image

img2.jpg:

Sample 3: 1 image

img3.jpg:

Sample 4: 1 image

img4.png:

Sample 5: 1 image

img5.jpg:

Sample 6: 1 image

img6.jpg:

Sample 7: 1 image

img7.jpg:

📤 output data 输出数据¶

Sample 1: 1 image

img1.png:

Sample 2: 1 image

img2.jpg:

Sample 3: 1 image

img3.jpg:

✨ explanation 解释¶

This test demonstrates the deduplication process when there are multiple duplicates of the same images. The operator retains only the first occurrence of each unique image, removing all subsequent duplicates. This results in the target list containing only the initial three unique images, despite the presence of duplicates in the input dataset. 此测试展示了当存在多份相同图像副本时的去重过程。算子仅保留每个唯一图像的首次出现，并移除所有后续的重复项。这导致目标列表中只包含最初的三个唯一图像，尽管输入数据集中存在重复项。

ray_image_deduplicator¶

🔧 Parameter Configuration 参数配置¶

📊 Effect demonstration 效果演示¶

test_2¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

test_3¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

🔗 related links 相关链接¶