image_deduplicator¶

Deduplicates samples at the document level by exact matching of images.

This operator compares images across documents to identify and remove duplicates.

It uses a specified hash method (default is ‘phash’) to compute image hashes.
If consider_text is set, it also considers text content for deduplication, using a text deduplicator in conjunction with the image hashes.
The key metric, imagehash, is computed for each sample. If consider_text is enabled, an additional hash field is used.
Duplicates are identified by comparing these hash values. Samples with identical hashes are considered duplicates.
When show_num is greater than 0, the operator also returns a subset of duplicate pairs for tracing purposes.
The operator caches the imagehash and, if applicable, the hash fields.

通过精确匹配图像在文档级别去重样本。

该算子比较文档间的图像以识别并移除重复项。

使用指定的哈希方法（默认是’phash’）计算图像哈希。
如果设置了consider_text，还会考虑文本内容进行去重，结合使用文本去重器和图像哈希。
关键指标imagehash为每个样本计算。如果启用了consider_text，则使用额外的hash字段。
通过比较这些哈希值来识别重复项。具有相同哈希值的样本被视为重复项。
当show_num大于0时，该算子还会返回一部分重复对以供追踪。
该算子缓存imagehash，如果适用，还缓存hash字段。

Type 算子类型: deduplicator

Tags 标签: cpu, image

🔧 Parameter Configuration 参数配置¶

name 参数名	type 类型	default 默认值	desc 说明
`method`	<class ‘str’>	`'phash'`	hash method for image
`consider_text`	<class ‘bool’>	`False`	whether to consider text hash together with image hash when applying deduplication.
`args`		`''`	extra args
`kwargs`		`''`	extra args

📊 Effect demonstration 效果演示¶

test_2¶

ImageDeduplicator()

📥 input data 输入数据¶

Sample 1: 1 image

img1.png:

Sample 2: 1 image

img2.jpg:

Sample 3: 1 image

img2.jpg:

📤 output data 输出数据¶

Sample 1: 1 image

img1.png:

Sample 2: 1 image

img2.jpg:

✨ explanation 解释¶

The operator removes one of the duplicate images, keeping only unique ones. In this case, the second and third entries contain the same image, so the third entry is removed, leaving the first and second entries as the final output. 算子移除重复的图片，只保留唯一的图片。在这种情况下，第二个和第三个条目包含相同的图片，因此移除了第三个条目，最终输出为第一个和第二个条目。

test_3_consider_text¶

ImageDeduplicator(consider_text=True)

📥 input data 输入数据¶

Sample 1: text | 1 image

<video> text1

img1.png:

Sample 2: text | 1 image

<video> text2

img2.jpg:

Sample 3: text | 1 image

<video> text3

img3.jpg:

Sample 4: text | 1 image

<video> text1

img4.png:

Sample 5: text | 1 image

<video> text5

img5.jpg:

Sample 6: text | 1 image

<video> text3

img6.jpg:

Sample 7: text | 1 image

<video> text7

img7.jpg:

📤 output data 输出数据¶

Sample 1: text | 1 image

<video> text1

img1.png:

Sample 2: text | 1 image

<video> text2

img2.jpg:

Sample 3: text | 1 image

<video> text3

img3.jpg:

Sample 4: text | 1 image

<video> text5

img5.jpg:

Sample 5: text | 1 image

<video> text7

img7.jpg:

✨ explanation 解释¶

The operator deduplicates samples based on both image and text content. Here, it keeps the first occurrence of each unique combination of image and text, removing subsequent duplicates. This results in a list where only the first occurrences of each unique image-text pair are retained, while all duplicates are removed. 算子基于图片和文本内容进行去重。这里，它保留每个唯一图片-文本组合的首次出现，并移除后续的重复项。结果是列表中只保留了每个唯一图片-文本对的首次出现，所有重复项都被移除。

image_deduplicator¶

🔧 Parameter Configuration 参数配置¶

📊 Effect demonstration 效果演示¶

test_2¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

test_3_consider_text¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

🔗 related links 相关链接¶