data_juicer.ops.deduplicator.document_simhash_deduplicator module

class data_juicer.ops.deduplicator.document_simhash_deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[源代码]

基类:Deduplicator

Deduplicator to deduplicate samples at document-level using SimHash.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[源代码]

Initialization method :param tokenization: tokenization method for sample texts.

It should be one of [space, punctuation, character]. For English-like languages, we recommend to use 'space'. And for Chinese-like languages, we recommend to use 'character'

参数:
  • window_size -- window size of shingling

  • lowercase -- whether to convert text to lower case first

  • ignore_pattern -- whether to ignore sub-strings with specific pattern when computing simhash

  • num_blocks -- number of blocks in simhash computing

  • hamming_distance -- the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks

compute_hash(sample)[源代码]

Compute simhash values for the sample.

参数:

sample -- input sample

返回:

sample with simhash value.

process(dataset, show_num=0)[源代码]

For doc-level, dataset --> dataset.

参数:
  • dataset -- input dataset

  • show_num -- number of traced samples used when tracer is open.

返回:

deduplicated dataset and the sampled duplicate pairs.