data_juicer.ops.deduplicator.document_deduplicator module

class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

Using md5 hash to deduplicate samples.

__init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lowercase – Whether to convert sample text to lower case

  • ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations

  • args – extra args

  • kwargs – extra args.

compute_hash(sample)[source]

Compute md5 hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with md5 hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.