data_juicer.ops.deduplicator.document_deduplicator module¶
- class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]¶
基类:
Deduplicator
Deduplicator to deduplicate samples at document-level using exact matching.
Using md5 hash to deduplicate samples.
- __init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]¶
Initialization method.
- 参数:
lowercase -- Whether to convert sample text to lower case
ignore_non_character -- Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args -- extra args
kwargs -- extra args.