document_deduplicator¶
Deduplicates samples at the document level using exact matching.
This operator computes an MD5 hash for each sample’s text. It can optionally convert the text to lowercase and ignore non-alphabet characters, including whitespaces, digits, and punctuation. The deduplication is based on the computed hash values, where samples with identical hashes are considered duplicates. The compute_hash
method adds a ‘hash’ key to each sample, storing its MD5 hash. During processing, the first occurrence of each unique hash is kept, and subsequent duplicates are filtered out. If the show_num
parameter is set, the operator also returns a specified number of duplicate pairs for inspection.
使用精确匹配在文档级别去重样本。
该算子计算每个样本文本的MD5哈希值。它可以可选地将文本转换为小写并忽略非字母字符,包括空格、数字和标点符号。去重基于计算出的哈希值,具有相同哈希值的样本被视为重复。compute_hash
方法向每个样本添加一个’hash’键,存储其MD5哈希值。在处理过程中,保留每个唯一哈希值的第一次出现,后续重复项被过滤掉。如果设置了show_num
参数,算子还会返回指定数量的重复对以供检查。
Type 算子类型: deduplicator
Tags 标签: cpu, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
<class ‘bool’> |
|
Whether to convert sample text to lower case |
|
<class ‘bool’> |
|
Whether to ignore non-alphabet |
|
|
extra args |
|
|
|
extra args. |
📊 Effect demonstration 效果演示¶
test_english_deduplication¶
DocumentDeduplicator(lowercase=False, ignore_non_character=False)
📥 input data 输入数据¶
Today is Sunday and it's a happy day!
Do you need a cup of coffee?
Today is sunday and it's a happy day!
This paper proposed a novel method on LLM pretraining.
This paper proposed a novel method on LLM pretraining.
📤 output data 输出数据¶
Today is Sunday and it's a happy day!
Do you need a cup of coffee?
Today is sunday and it's a happy day!
This paper proposed a novel method on LLM pretraining.
✨ explanation 解释¶
The operator computes an MD5 hash for each sample’s text without converting to lowercase or ignoring non-alphabet characters. It keeps the first occurrence of each unique hash and removes subsequent duplicates, resulting in a list where only one instance of ‘This paper proposed a novel method on LLM pretraining.’ is kept. 算子对每个样本的文本计算MD5哈希值,不转换为小写也不忽略非字母字符。它保留每个唯一哈希值的首次出现,并移除后续重复项,结果列表中只保留了一个’This paper proposed a novel method on LLM pretraining.’实例。
test_english_deduplication_with_params¶
DocumentDeduplicator(lowercase=True, ignore_non_character=True)
📥 input data 输入数据¶
Today is Sunday and it's a happy day!
Do you need a cup of coffee?
Today is sunday and it's a happy day!
Today is sunday and it's a happy day?
This paper proposed a novel method on LLM pretraining.
This paper proposed a novel method on LLM pretraining.
📤 output data 输出数据¶
Today is Sunday and it's a happy day!
Do you need a cup of coffee?
This paper proposed a novel method on LLM pretraining.
✨ explanation 解释¶
The operator computes an MD5 hash for each sample’s text after converting it to lowercase and ignoring non-alphabet characters. This results in more aggressive deduplication, keeping only the first occurrence of each unique hash after processing, thus removing all but one instance of similar sentences like ‘Today is sunday and it’s a happy day!’. 算子在将文本转换为小写并忽略非字母字符后,对每个样本的文本计算MD5哈希值。这导致了更积极的去重处理,仅保留处理后每个唯一哈希值的首次出现,从而删除了如’Today is sunday and it’s a happy day!’这样的相似句子的所有但一个实例。