data_juicer.ops.deduplicator.document_minhash_deduplicator module

data_juicer.ops.deduplicator.document_minhash_deduplicator.sha1_hash32(data)[源代码]

Directly taken from datasketch package to avoid dependency.

参数:

data (bytes)

返回类型:

int

data_juicer.ops.deduplicator.document_minhash_deduplicator.optimal_param(threshold: float, num_perm: int, false_positive_weight: float = 0.5, false_negative_weight: float = 0.5)[源代码]

Compute the optimal MinHashLSH parameter that minimizes the weighted sum of probabilities of false positive and false negative, taken from datasketch.

参数:
  • threshold -- float. The threshold for similarity

  • num_perm -- int. The number of permutations

  • false_positive_weight -- float. The weight of false positive

  • false_negative_weight -- float. The weight of false negative

返回:

Tuple[int, int]. The optimal b and r parameters. The number of bands, and the number of rows per band respectively

class data_juicer.ops.deduplicator.document_minhash_deduplicator.DocumentMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[源代码]

基类:Deduplicator

Deduplicates samples at the document level using MinHash LSH.

This operator computes MinHash values for each sample and uses Locality-Sensitive Hashing (LSH) to identify and remove near-duplicate documents. The Jaccard similarity threshold determines when two documents are considered duplicates. The tokenization method can be customized, and a Hugging Face tokenizer can be used for 'sentencepiece' tokenization. The minhash values are stored as bytes and are not kept in the final dataset. The number of bands and rows per band in LSH can be set manually or determined by an optimal parameter computation algorithm. Important notes: - If using 'punctuation' tokenization with an ignore pattern, ensure the pattern does

not include punctuations.

  • For 'sentencepiece' tokenization, a tokenizer model path is required.

  • The deduplication process involves clustering and filtering, and only unique samples or the first sample in a cluster are retained.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[源代码]

Initialization method.

参数:
  • tokenization -- tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use 'space', for Chinese-like languages, we recommend to use 'character', and for multiple languages, we recommend to use 'sentencepiece'. If using 'sentencepiece', please provided the model path in the 'tokenizer_model' field.

  • window_size -- window size of shingling

  • lowercase -- whether to convert text to lower case first

  • ignore_pattern -- whether to ignore sub-strings with specific pattern when computing minhash

  • num_permutations -- number of permutations in minhash computing

  • jaccard_threshold -- the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication

  • num_bands -- number of bands in LSH. Default it's None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives

  • num_rows_per_band -- number of rows in each band in LSH. Default it's None, and it will be determined by an optimal params computation algorithm

  • tokenizer_model -- path for the sentencepiece model, used for sentencepiece tokenization.

compute_hash(sample)[源代码]

Compute minhash values for the sample.

参数:

sample -- input sample

返回:

sample with minhash value.

process(dataset, show_num=0)[源代码]

For doc-level, dataset --> dataset.

参数:
  • dataset -- input dataset

  • show_num -- number of traced samples used when tracer is open.

返回:

deduplicated dataset and the sampled duplicate pairs.