data_juicer.ops.deduplicator.document_deduplicator module¶
- class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using exact matching.
Using md5 hash to deduplicate samples.
- __init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
lowercase – Whether to convert sample text to lower case
ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args – extra args
kwargs – extra args.