data_juicer.ops.deduplicator.document_simhash_deduplicator module¶
- class data_juicer.ops.deduplicator.document_simhash_deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[源代码]¶
基类:
Deduplicator
Deduplicator to deduplicate samples at document-level using SimHash.
- __init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[源代码]¶
Initialization method :param tokenization: tokenization method for sample texts.
It should be one of [space, punctuation, character]. For English-like languages, we recommend to use 'space'. And for Chinese-like languages, we recommend to use 'character'
- 参数:
window_size -- window size of shingling
lowercase -- whether to convert text to lower case first
ignore_pattern -- whether to ignore sub-strings with specific pattern when computing simhash
num_blocks -- number of blocks in simhash computing
hamming_distance -- the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks