image_deduplicator

Deduplicates samples at the document level by exact matching of images.

This operator compares images across documents to identify and remove duplicates.

  • It uses a specified hash method (default is 'phash') to compute image hashes.

  • If consider_text is set, it also considers text content for deduplication, using a text deduplicator in conjunction with the image hashes.

  • The key metric, imagehash, is computed for each sample. If consider_text is enabled, an additional hash field is used.

  • Duplicates are identified by comparing these hash values. Samples with identical hashes are considered duplicates.

  • When show_num is greater than 0, the operator also returns a subset of duplicate pairs for tracing purposes.

  • The operator caches the imagehash and, if applicable, the hash fields.

通过精确匹配图像在文档级别去重样本。

该算子比较文档间的图像以识别并移除重复项。

  • 使用指定的哈希方法(默认是'phash')计算图像哈希。

  • 如果设置了consider_text,还会考虑文本内容进行去重,结合使用文本去重器和图像哈希。

  • 关键指标imagehash为每个样本计算。如果启用了consider_text,则使用额外的hash字段。

  • 通过比较这些哈希值来识别重复项。具有相同哈希值的样本被视为重复项。

  • show_num大于0时,该算子还会返回一部分重复对以供追踪。

  • 该算子缓存imagehash,如果适用,还缓存hash字段。

Type 算子类型: deduplicator

Tags 标签: cpu, image

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

method

<class 'str'>

'phash'

hash method for image

consider_text

<class 'bool'>

False

whether to consider text hash together with image

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_2

ImageDeduplicator()

📥 input data 输入数据

Sample 1: 1 image
img1.png:
Sample 2: 1 image
img2.jpg:
Sample 3: 1 image
img2.jpg:

📤 output data 输出数据

Sample 1: 1 image
img1.png:
Sample 2: 1 image
img2.jpg:

✨ explanation 解释

The operator removes one of the duplicate images, keeping only unique ones. In this case, the second and third entries contain the same image, so the third entry is removed, leaving the first and second entries as the final output. 算子移除重复的图片,只保留唯一的图片。在这种情况下,第二个和第三个条目包含相同的图片,因此移除了第三个条目,最终输出为第一个和第二个条目。

test_3_consider_text

ImageDeduplicator(consider_text=True)

📥 input data 输入数据

Sample 1: text | 1 image
<video> text1
img1.png:
Sample 2: text | 1 image
<video> text2
img2.jpg:
Sample 3: text | 1 image
<video> text3
img3.jpg:
Sample 4: text | 1 image
<video> text1
img4.png:
Sample 5: text | 1 image
<video> text5
img5.jpg:
Sample 6: text | 1 image
<video> text3
img6.jpg:
Sample 7: text | 1 image
<video> text7
img7.jpg:

📤 output data 输出数据

Sample 1: text | 1 image
<video> text1
img1.png:
Sample 2: text | 1 image
<video> text2
img2.jpg:
Sample 3: text | 1 image
<video> text3
img3.jpg:
Sample 4: text | 1 image
<video> text5
img5.jpg:
Sample 5: text | 1 image
<video> text7
img7.jpg:

✨ explanation 解释

The operator deduplicates samples based on both image and text content. Here, it keeps the first occurrence of each unique combination of image and text, removing subsequent duplicates. This results in a list where only the first occurrences of each unique image-text pair are retained, while all duplicates are removed. 算子基于图片和文本内容进行去重。这里,它保留每个唯一图片-文本组合的首次出现,并移除后续的重复项。结果是列表中只保留了每个唯一图片-文本对的首次出现,所有重复项都被移除。