data_juicer.ops.deduplicator.document_deduplicator module

class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicates samples at the document level using exact matching.

This operator computes an MD5 hash for each sample’s text. It can optionally convert the text to lowercase and ignore non-alphabet characters, including whitespaces, digits, and punctuation. The deduplication is based on the computed hash values, where samples with identical hashes are considered duplicates. The compute_hash method adds a ‘hash’ key to each sample, storing its MD5 hash. During processing, the first occurrence of each unique hash is kept, and subsequent duplicates are filtered out. If the show_num parameter is set, the operator also returns a specified number of duplicate pairs for inspection.

__init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lowercase – Whether to convert sample text to lower case

  • ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations

  • args – extra args

  • kwargs – extra args.

compute_hash(sample)[source]

Compute md5 hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with md5 hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.