data_juicer.ops.deduplicator package¶
Submodules¶
data_juicer.ops.deduplicator.document_deduplicator module¶
- class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using exact matching.
Using md5 hash to deduplicate samples.
- __init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
lowercase – Whether to convert sample text to lower case
ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args – extra args
kwargs – extra args.
data_juicer.ops.deduplicator.document_minhash_deduplicator module¶
- data_juicer.ops.deduplicator.document_minhash_deduplicator.sha1_hash32(data)[source]¶
Directly taken from datasketch package to avoid dependency.
- Parameters:
data (bytes)
- Return type:
int
- data_juicer.ops.deduplicator.document_minhash_deduplicator.optimal_param(threshold: float, num_perm: int, false_positive_weight: float = 0.5, false_negative_weight: float = 0.5)[source]¶
Compute the optimal MinHashLSH parameter that minimizes the weighted sum of probabilities of false positive and false negative, taken from datasketch.
- Parameters:
threshold – float. The threshold for similarity
num_perm – int. The number of permutations
false_positive_weight – float. The weight of false positive
false_negative_weight – float. The weight of false negative
- Returns:
Tuple[int, int]. The optimal b and r parameters. The number of bands, and the number of rows per band respectively
- class data_juicer.ops.deduplicator.document_minhash_deduplicator.DocumentMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using MinHashLSH.
Different from simhash, minhash is stored as bytes, so they won’t be kept in the final dataset.
- __init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
tokenization – tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use ‘space’, for Chinese-like languages, we recommend to use ‘character’, and for multiple languages, we recommend to use ‘sentencepiece’. If using ‘sentencepiece’, please provided the model path in the ‘tokenizer_model’ field.
window_size – window size of shingling
lowercase – whether to convert text to lower case first
ignore_pattern – whether to ignore sub-strings with specific pattern when computing minhash
num_permutations – number of permutations in minhash computing
jaccard_threshold – the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication
num_bands – number of bands in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives
num_rows_per_band – number of rows in each band in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm
tokenizer_model – path for the sentencepiece model, used for sentencepiece tokenization.
data_juicer.ops.deduplicator.document_simhash_deduplicator module¶
- class data_juicer.ops.deduplicator.document_simhash_deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using SimHash.
- __init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]¶
Initialization method :param tokenization: tokenization method for sample texts.
It should be one of [space, punctuation, character]. For English-like languages, we recommend to use ‘space’. And for Chinese-like languages, we recommend to use ‘character’
- Parameters:
window_size – window size of shingling
lowercase – whether to convert text to lower case first
ignore_pattern – whether to ignore sub-strings with specific pattern when computing simhash
num_blocks – number of blocks in simhash computing
hamming_distance – the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks
data_juicer.ops.deduplicator.image_deduplicator module¶
- class data_juicer.ops.deduplicator.image_deduplicator.ImageDeduplicator(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using exact matching of images between documents.
- __init__(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
method – hash method for image
consider_text – whether to consider text hash together with image hash when applying deduplication.
args – extra args
kwargs – extra args
data_juicer.ops.deduplicator.ray_basic_deduplicator module¶
- class data_juicer.ops.deduplicator.ray_basic_deduplicator.RayBasicDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]¶
Bases:
Filter
A basic exact matching deduplicator for RAY. Although its functionality is deduplication, it is implemented as Filter sub-class.
- EMPTY_HASH_VALUE = 'EMPTY'¶
- __init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]¶
Initialization. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param args: extra args :param kwargs: extra args
- compute_stats_single(sample, context=False)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats
data_juicer.ops.deduplicator.ray_document_deduplicator module¶
- class data_juicer.ops.deduplicator.ray_document_deduplicator.RayDocumentDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Bases:
RayBasicDeduplicator
Deduplicator to deduplicate samples at document-level using exact matching.
- __init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Initialization method. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param lowercase: Whether to convert sample text to lower case :param ignore_non_character: Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations :param args: extra args :param kwargs: extra args.
data_juicer.ops.deduplicator.ray_image_deduplicator module¶
- class data_juicer.ops.deduplicator.ray_image_deduplicator.RayImageDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, method: str = 'phash', *args, **kwargs)[source]¶
Bases:
RayBasicDeduplicator
Deduplicator to deduplicate samples at document-level using exact matching of images between documents.
data_juicer.ops.deduplicator.ray_video_deduplicator module¶
- class data_juicer.ops.deduplicator.ray_video_deduplicator.RayVideoDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]¶
Bases:
RayBasicDeduplicator
Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.
data_juicer.ops.deduplicator.video_deduplicator module¶
- class data_juicer.ops.deduplicator.video_deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.
- __init__(consider_text: bool = False, *args, **kwargs)[source]¶
Initialization.
- Parameters:
consider_text – whether to consider text hash together with video hash when applying deduplication.
args – extra args
kwargs – extra args
Module contents¶
- class data_juicer.ops.deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using exact matching.
Using md5 hash to deduplicate samples.
- __init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
lowercase – Whether to convert sample text to lower case
ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args – extra args
kwargs – extra args.
- class data_juicer.ops.deduplicator.DocumentMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using MinHashLSH.
Different from simhash, minhash is stored as bytes, so they won’t be kept in the final dataset.
- __init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
tokenization – tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use ‘space’, for Chinese-like languages, we recommend to use ‘character’, and for multiple languages, we recommend to use ‘sentencepiece’. If using ‘sentencepiece’, please provided the model path in the ‘tokenizer_model’ field.
window_size – window size of shingling
lowercase – whether to convert text to lower case first
ignore_pattern – whether to ignore sub-strings with specific pattern when computing minhash
num_permutations – number of permutations in minhash computing
jaccard_threshold – the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication
num_bands – number of bands in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives
num_rows_per_band – number of rows in each band in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm
tokenizer_model – path for the sentencepiece model, used for sentencepiece tokenization.
- class data_juicer.ops.deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using SimHash.
- __init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]¶
Initialization method :param tokenization: tokenization method for sample texts.
It should be one of [space, punctuation, character]. For English-like languages, we recommend to use ‘space’. And for Chinese-like languages, we recommend to use ‘character’
- Parameters:
window_size – window size of shingling
lowercase – whether to convert text to lower case first
ignore_pattern – whether to ignore sub-strings with specific pattern when computing simhash
num_blocks – number of blocks in simhash computing
hamming_distance – the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks
- class data_juicer.ops.deduplicator.ImageDeduplicator(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using exact matching of images between documents.
- __init__(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
method – hash method for image
consider_text – whether to consider text hash together with image hash when applying deduplication.
args – extra args
kwargs – extra args
- class data_juicer.ops.deduplicator.RayBasicDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]¶
Bases:
Filter
A basic exact matching deduplicator for RAY. Although its functionality is deduplication, it is implemented as Filter sub-class.
- EMPTY_HASH_VALUE = 'EMPTY'¶
- __init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]¶
Initialization. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param args: extra args :param kwargs: extra args
- compute_stats_single(sample, context=False)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats
- class data_juicer.ops.deduplicator.RayDocumentDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Bases:
RayBasicDeduplicator
Deduplicator to deduplicate samples at document-level using exact matching.
- __init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶
Initialization method. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param lowercase: Whether to convert sample text to lower case :param ignore_non_character: Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations :param args: extra args :param kwargs: extra args.
- class data_juicer.ops.deduplicator.RayImageDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, method: str = 'phash', *args, **kwargs)[source]¶
Bases:
RayBasicDeduplicator
Deduplicator to deduplicate samples at document-level using exact matching of images between documents.
- class data_juicer.ops.deduplicator.RayVideoDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]¶
Bases:
RayBasicDeduplicator
Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.
- class data_juicer.ops.deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[source]¶
Bases:
Deduplicator
Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.
- __init__(consider_text: bool = False, *args, **kwargs)[source]¶
Initialization.
- Parameters:
consider_text – whether to consider text hash together with video hash when applying deduplication.
args – extra args
kwargs – extra args