data_juicer.ops.deduplicator.document_simhash_deduplicator module¶

class data_juicer.ops.deduplicator.document_simhash_deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]¶

Bases: Deduplicator

Deduplicates samples at the document level using SimHash.

This operator computes SimHash values for each sample and removes duplicates based on a specified Hamming distance threshold. It supports different tokenization methods: ‘space’, ‘punctuation’, and ‘character’. The SimHash is computed over shingles of a given window size, and the deduplication process clusters similar documents and retains only one from each cluster. The default mode converts text to lowercase and can ignore specific patterns. The key metric, Hamming distance, is used to determine similarity between SimHash values. Important notes: - The ignore_pattern parameter can be used to exclude certain substrings during

SimHash computation.

For punctuation-based tokenization, the ignore_pattern should not include punctuations to avoid conflicts.
The hamming_distance must be less than the number of blocks (num_blocks).
Only the first sample in each cluster is retained by default.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]¶

Initialization method :param tokenization: tokenization method for sample texts.

It should be one of [space, punctuation, character]. For English-like languages, we recommend to use ‘space’. And for Chinese-like languages, we recommend to use ‘character’

Parameters:

tokenization – tokenization method for sample texts
window_size – window size of shingling
lowercase – whether to convert text to lower case first
ignore_pattern – whether to ignore sub-strings with specific pattern when computing simhash
num_blocks – number of blocks in simhash computing
hamming_distance – the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks

compute_hash(sample)[source]¶

Compute simhash values for the sample.

Parameters:: sample – input sample
Returns:: sample with simhash value.

process(dataset, show_num=0)[source]¶

For doc-level, dataset –> dataset.

Parameters:

dataset – input dataset
show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.