# image_deduplicator Deduplicates samples at the document level by exact matching of images. This operator compares images across documents to identify and remove duplicates. - It uses a specified hash method (default is 'phash') to compute image hashes. - If `consider_text` is set, it also considers text content for deduplication, using a text deduplicator in conjunction with the image hashes. - The key metric, `imagehash`, is computed for each sample. If `consider_text` is enabled, an additional `hash` field is used. - Duplicates are identified by comparing these hash values. Samples with identical hashes are considered duplicates. - When `show_num` is greater than 0, the operator also returns a subset of duplicate pairs for tracing purposes. - The operator caches the `imagehash` and, if applicable, the `hash` fields. 通过精确匹配图像在文档级别去重样本。 该算子比较文档间的图像以识别并移除重复项。 - 使用指定的哈希方法(默认是'phash')计算图像哈希。 - 如果设置了`consider_text`,还会考虑文本内容进行去重,结合使用文本去重器和图像哈希。 - 关键指标`imagehash`为每个样本计算。如果启用了`consider_text`,则使用额外的`hash`字段。 - 通过比较这些哈希值来识别重复项。具有相同哈希值的样本被视为重复项。 - 当`show_num`大于0时,该算子还会返回一部分重复对以供追踪。 - 该算子缓存`imagehash`,如果适用,还缓存`hash`字段。 Type 算子类型: **deduplicator** Tags 标签: cpu, image ## 🔧 Parameter Configuration 参数配置 | name 参数名 | type 类型 | default 默认值 | desc 说明 | |--------|------|--------|------| | `method` | | `'phash'` | hash method for image | | `consider_text` | | `False` | whether to consider text hash together with image | | `args` | | `''` | extra args | | `kwargs` | | `''` | extra args | ## 📊 Effect demonstration 效果演示 ### test_2 ```python ImageDeduplicator() ``` #### 📥 input data 输入数据
Sample 1: 1 image
img1.png:
Sample 2: 1 image
img2.jpg:
Sample 3: 1 image
img2.jpg:
#### 📤 output data 输出数据
Sample 1: 1 image
img1.png:
Sample 2: 1 image
img2.jpg:
#### ✨ explanation 解释 The operator removes one of the duplicate images, keeping only unique ones. In this case, the second and third entries contain the same image, so the third entry is removed, leaving the first and second entries as the final output. 算子移除重复的图片,只保留唯一的图片。在这种情况下,第二个和第三个条目包含相同的图片,因此移除了第三个条目,最终输出为第一个和第二个条目。 ### test_3_consider_text ```python ImageDeduplicator(consider_text=True) ``` #### 📥 input data 输入数据
Sample 1: text | 1 image
<video> text1
img1.png:
Sample 2: text | 1 image
<video> text2
img2.jpg:
Sample 3: text | 1 image
<video> text3
img3.jpg:
Sample 4: text | 1 image
<video> text1
img4.png:
Sample 5: text | 1 image
<video> text5
img5.jpg:
Sample 6: text | 1 image
<video> text3
img6.jpg:
Sample 7: text | 1 image
<video> text7
img7.jpg:
#### 📤 output data 输出数据
Sample 1: text | 1 image
<video> text1
img1.png:
Sample 2: text | 1 image
<video> text2
img2.jpg:
Sample 3: text | 1 image
<video> text3
img3.jpg:
Sample 4: text | 1 image
<video> text5
img5.jpg:
Sample 5: text | 1 image
<video> text7
img7.jpg:
#### ✨ explanation 解释 The operator deduplicates samples based on both image and text content. Here, it keeps the first occurrence of each unique combination of image and text, removing subsequent duplicates. This results in a list where only the first occurrences of each unique image-text pair are retained, while all duplicates are removed. 算子基于图片和文本内容进行去重。这里,它保留每个唯一图片-文本组合的首次出现,并移除后续的重复项。结果是列表中只保留了每个唯一图片-文本对的首次出现,所有重复项都被移除。 ## 🔗 related links 相关链接 - [source code 源代码](../../../data_juicer/ops/deduplicator/image_deduplicator.py) - [unit test 单元测试](../../../tests/ops/deduplicator/test_image_deduplicator.py) - [Return operator list 返回算子列表](../../Operators.md)