data_juicer.ops.deduplicator.document_simhash_deduplicator module

class data_juicer.ops.deduplicator.document_simhash_deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using SimHash.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]

Initialization method :param tokenization: tokenization method for sample texts.

It should be one of [space, punctuation, character]. For English-like languages, we recommend to use ‘space’. And for Chinese-like languages, we recommend to use ‘character’

Parameters:
  • window_size – window size of shingling

  • lowercase – whether to convert text to lower case first

  • ignore_pattern – whether to ignore sub-strings with specific pattern when computing simhash

  • num_blocks – number of blocks in simhash computing

  • hamming_distance – the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks

compute_hash(sample)[source]

Compute simhash values for the sample.

Parameters:

sample – input sample

Returns:

sample with simhash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.