data_juicer.ops.deduplicator.document_deduplicator module¶
- class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicates samples at the document level using exact matching.
This operator computes an MD5 hash for each sample’s text. It can optionally convert the text to lowercase and ignore non-alphabet characters, including whitespaces, digits, and punctuation. The deduplication is based on the computed hash values, where samples with identical hashes are considered duplicates. The compute_hash method adds a ‘hash’ key to each sample, storing its MD5 hash. During processing, the first occurrence of each unique hash is kept, and subsequent duplicates are filtered out. If the show_num parameter is set, the operator also returns a specified number of duplicate pairs for inspection.
- __init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
lowercase – Whether to convert sample text to lower case
ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args – extra args
kwargs – extra args.