document_deduplicator

Deduplicates samples at the document level using exact matching.

This operator computes an MD5 hash for each sample’s text. It can optionally convert the text to lowercase and ignore non-alphabet characters, including whitespaces, digits, and punctuation. The deduplication is based on the computed hash values, where samples with identical hashes are considered duplicates. The compute_hash method adds a ‘hash’ key to each sample, storing its MD5 hash. During processing, the first occurrence of each unique hash is kept, and subsequent duplicates are filtered out. If the show_num parameter is set, the operator also returns a specified number of duplicate pairs for inspection.

使用精确匹配在文档级别去重样本。

该算子计算每个样本文本的MD5哈希值。它可以可选地将文本转换为小写并忽略非字母字符,包括空格、数字和标点符号。去重基于计算出的哈希值,具有相同哈希值的样本被视为重复。compute_hash方法向每个样本添加一个’hash’键,存储其MD5哈希值。在处理过程中,保留每个唯一哈希值的第一次出现,后续重复项被过滤掉。如果设置了show_num参数,算子还会返回指定数量的重复对以供检查。

Type 算子类型: deduplicator

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

lowercase

<class ‘bool’>

False

Whether to convert sample text to lower case

ignore_non_character

<class ‘bool’>

False

Whether to ignore non-alphabet

args

''

extra args

kwargs

''

extra args.

📊 Effect demonstration 效果演示

test_english_deduplication

DocumentDeduplicator(lowercase=False, ignore_non_character=False)

📥 input data 输入数据

Sample 1: text
Today is Sunday and it's a happy day!
Sample 2: text
Do you need a cup of coffee?
Sample 3: text
Today is sunday and it's a happy day!
Sample 4: text
This paper proposed a novel method on LLM pretraining.
Sample 5: text
This paper proposed a novel method on LLM pretraining.

📤 output data 输出数据

Sample 1: text
Today is Sunday and it's a happy day!
Sample 2: text
Do you need a cup of coffee?
Sample 3: text
Today is sunday and it's a happy day!
Sample 4: text
This paper proposed a novel method on LLM pretraining.

✨ explanation 解释

The operator computes an MD5 hash for each sample’s text without converting to lowercase or ignoring non-alphabet characters. It keeps the first occurrence of each unique hash and removes subsequent duplicates, resulting in a list where only one instance of ‘This paper proposed a novel method on LLM pretraining.’ is kept. 算子对每个样本的文本计算MD5哈希值,不转换为小写也不忽略非字母字符。它保留每个唯一哈希值的首次出现,并移除后续重复项,结果列表中只保留了一个’This paper proposed a novel method on LLM pretraining.’实例。

test_english_deduplication_with_params

DocumentDeduplicator(lowercase=True, ignore_non_character=True)

📥 input data 输入数据

Sample 1: text
Today is Sunday and it's a happy day!
Sample 2: text
Do you need a cup of coffee?
Sample 3: text
Today is sunday and it's a happy day!
Sample 4: text
Today is sunday and it's a happy day?
Sample 5: text
This paper proposed a novel method on LLM pretraining.
Sample 6: text
This paper proposed a novel method on LLM pretraining.

📤 output data 输出数据

Sample 1: text
Today is Sunday and it's a happy day!
Sample 2: text
Do you need a cup of coffee?
Sample 3: text
This paper proposed a novel method on LLM pretraining.

✨ explanation 解释

The operator computes an MD5 hash for each sample’s text after converting it to lowercase and ignoring non-alphabet characters. This results in more aggressive deduplication, keeping only the first occurrence of each unique hash after processing, thus removing all but one instance of similar sentences like ‘Today is sunday and it’s a happy day!’. 算子在将文本转换为小写并忽略非字母字符后,对每个样本的文本计算MD5哈希值。这导致了更积极的去重处理,仅保留处理后每个唯一哈希值的首次出现,从而删除了如’Today is sunday and it’s a happy day!’这样的相似句子的所有但一个实例。