image_deduplicator¶
Deduplicates samples at the document level by exact matching of images.
This operator compares images across documents to identify and remove duplicates.
It uses a specified hash method (default is ‘phash’) to compute image hashes.
If
consider_text
is set, it also considers text content for deduplication, using a text deduplicator in conjunction with the image hashes.The key metric,
imagehash
, is computed for each sample. Ifconsider_text
is enabled, an additionalhash
field is used.Duplicates are identified by comparing these hash values. Samples with identical hashes are considered duplicates.
When
show_num
is greater than 0, the operator also returns a subset of duplicate pairs for tracing purposes.The operator caches the
imagehash
and, if applicable, thehash
fields.
通过精确匹配图像在文档级别去重样本。
该算子比较文档间的图像以识别并移除重复项。
使用指定的哈希方法(默认是’phash’)计算图像哈希。
如果设置了
consider_text
,还会考虑文本内容进行去重,结合使用文本去重器和图像哈希。关键指标
imagehash
为每个样本计算。如果启用了consider_text
,则使用额外的hash
字段。通过比较这些哈希值来识别重复项。具有相同哈希值的样本被视为重复项。
当
show_num
大于0时,该算子还会返回一部分重复对以供追踪。该算子缓存
imagehash
,如果适用,还缓存hash
字段。
Type 算子类型: deduplicator
Tags 标签: cpu, image
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
<class ‘str’> |
|
hash method for image |
|
<class ‘bool’> |
|
whether to consider text hash together with image |
|
|
extra args |
|
|
|
extra args |
📊 Effect demonstration 效果演示¶
test_2¶
ImageDeduplicator()
📥 input data 输入数据¶



📤 output data 输出数据¶


✨ explanation 解释¶
The operator removes one of the duplicate images, keeping only unique ones. In this case, the second and third entries contain the same image, so the third entry is removed, leaving the first and second entries as the final output. 算子移除重复的图片,只保留唯一的图片。在这种情况下,第二个和第三个条目包含相同的图片,因此移除了第三个条目,最终输出为第一个和第二个条目。
test_3_consider_text¶
ImageDeduplicator(consider_text=True)
📥 input data 输入数据¶
<video> text1

<video> text2

<video> text3

<video> text1

<video> text5

<video> text3

<video> text7

📤 output data 输出数据¶
<video> text1

<video> text2

<video> text3

<video> text5

<video> text7

✨ explanation 解释¶
The operator deduplicates samples based on both image and text content. Here, it keeps the first occurrence of each unique combination of image and text, removing subsequent duplicates. This results in a list where only the first occurrences of each unique image-text pair are retained, while all duplicates are removed. 算子基于图片和文本内容进行去重。这里,它保留每个唯一图片-文本组合的首次出现,并移除后续的重复项。结果是列表中只保留了每个唯一图片-文本对的首次出现,所有重复项都被移除。