data_juicer.ops.deduplicator package

Submodules

data_juicer.ops.deduplicator.document_deduplicator module

class data_juicer.ops.deduplicator.document_deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

Using md5 hash to deduplicate samples.

__init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lowercase – Whether to convert sample text to lower case

  • ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations

  • args – extra args

  • kwargs – extra args.

compute_hash(sample)[source]

Compute md5 hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with md5 hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

data_juicer.ops.deduplicator.document_minhash_deduplicator module

data_juicer.ops.deduplicator.document_minhash_deduplicator.sha1_hash32(data)[source]

Directly taken from datasketch package to avoid dependency.

Parameters:

data (bytes)

Return type:

int

data_juicer.ops.deduplicator.document_minhash_deduplicator.optimal_param(threshold: float, num_perm: int, false_positive_weight: float = 0.5, false_negative_weight: float = 0.5)[source]

Compute the optimal MinHashLSH parameter that minimizes the weighted sum of probabilities of false positive and false negative, taken from datasketch.

Parameters:
  • threshold – float. The threshold for similarity

  • num_perm – int. The number of permutations

  • false_positive_weight – float. The weight of false positive

  • false_negative_weight – float. The weight of false negative

Returns:

Tuple[int, int]. The optimal b and r parameters. The number of bands, and the number of rows per band respectively

class data_juicer.ops.deduplicator.document_minhash_deduplicator.DocumentMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using MinHashLSH.

Different from simhash, minhash is stored as bytes, so they won’t be kept in the final dataset.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • tokenization – tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use ‘space’, for Chinese-like languages, we recommend to use ‘character’, and for multiple languages, we recommend to use ‘sentencepiece’. If using ‘sentencepiece’, please provided the model path in the ‘tokenizer_model’ field.

  • window_size – window size of shingling

  • lowercase – whether to convert text to lower case first

  • ignore_pattern – whether to ignore sub-strings with specific pattern when computing minhash

  • num_permutations – number of permutations in minhash computing

  • jaccard_threshold – the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication

  • num_bands – number of bands in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives

  • num_rows_per_band – number of rows in each band in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm

  • tokenizer_model – path for the sentencepiece model, used for sentencepiece tokenization.

compute_hash(sample)[source]

Compute minhash values for the sample.

Parameters:

sample – input sample

Returns:

sample with minhash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

data_juicer.ops.deduplicator.document_simhash_deduplicator module

class data_juicer.ops.deduplicator.document_simhash_deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using SimHash.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]

Initialization method :param tokenization: tokenization method for sample texts.

It should be one of [space, punctuation, character]. For English-like languages, we recommend to use ‘space’. And for Chinese-like languages, we recommend to use ‘character’

Parameters:
  • window_size – window size of shingling

  • lowercase – whether to convert text to lower case first

  • ignore_pattern – whether to ignore sub-strings with specific pattern when computing simhash

  • num_blocks – number of blocks in simhash computing

  • hamming_distance – the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks

compute_hash(sample)[source]

Compute simhash values for the sample.

Parameters:

sample – input sample

Returns:

sample with simhash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

data_juicer.ops.deduplicator.image_deduplicator module

data_juicer.ops.deduplicator.image_deduplicator.get_hash_method(method_name)[source]
class data_juicer.ops.deduplicator.image_deduplicator.ImageDeduplicator(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching of images between documents.

__init__(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • method – hash method for image

  • consider_text – whether to consider text hash together with image hash when applying deduplication.

  • args – extra args

  • kwargs – extra args

compute_hash(sample, context=False)[source]

Compute hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with computed hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

data_juicer.ops.deduplicator.ray_basic_deduplicator module

class data_juicer.ops.deduplicator.ray_basic_deduplicator.RayBasicDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]

Bases: Filter

A basic exact matching deduplicator for RAY. Although its functionality is deduplication, it is implemented as Filter sub-class.

EMPTY_HASH_VALUE = 'EMPTY'
__init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]

Initialization. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

data_juicer.ops.deduplicator.ray_document_deduplicator module

class data_juicer.ops.deduplicator.ray_document_deduplicator.RayDocumentDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

__init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Initialization method. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param lowercase: Whether to convert sample text to lower case :param ignore_non_character: Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations :param args: extra args :param kwargs: extra args.

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

data_juicer.ops.deduplicator.ray_image_deduplicator module

data_juicer.ops.deduplicator.ray_image_deduplicator.get_hash_method(method_name)[source]
class data_juicer.ops.deduplicator.ray_image_deduplicator.RayImageDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, method: str = 'phash', *args, **kwargs)[source]

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching of images between documents.

__init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, method: str = 'phash', *args, **kwargs)[source]

Initialization. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

data_juicer.ops.deduplicator.ray_video_deduplicator module

class data_juicer.ops.deduplicator.ray_video_deduplicator.RayVideoDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.

__init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]

Initialization. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

data_juicer.ops.deduplicator.video_deduplicator module

class data_juicer.ops.deduplicator.video_deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.

__init__(consider_text: bool = False, *args, **kwargs)[source]

Initialization.

Parameters:
  • consider_text – whether to consider text hash together with video hash when applying deduplication.

  • args – extra args

  • kwargs – extra args

compute_hash(sample, context=False)[source]

Compute hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with computed hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

Module contents

class data_juicer.ops.deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

Using md5 hash to deduplicate samples.

__init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lowercase – Whether to convert sample text to lower case

  • ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations

  • args – extra args

  • kwargs – extra args.

compute_hash(sample)[source]

Compute md5 hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with md5 hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.DocumentMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using MinHashLSH.

Different from simhash, minhash is stored as bytes, so they won’t be kept in the final dataset.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • tokenization – tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use ‘space’, for Chinese-like languages, we recommend to use ‘character’, and for multiple languages, we recommend to use ‘sentencepiece’. If using ‘sentencepiece’, please provided the model path in the ‘tokenizer_model’ field.

  • window_size – window size of shingling

  • lowercase – whether to convert text to lower case first

  • ignore_pattern – whether to ignore sub-strings with specific pattern when computing minhash

  • num_permutations – number of permutations in minhash computing

  • jaccard_threshold – the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication

  • num_bands – number of bands in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives

  • num_rows_per_band – number of rows in each band in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm

  • tokenizer_model – path for the sentencepiece model, used for sentencepiece tokenization.

compute_hash(sample)[source]

Compute minhash values for the sample.

Parameters:

sample – input sample

Returns:

sample with minhash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using SimHash.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]

Initialization method :param tokenization: tokenization method for sample texts.

It should be one of [space, punctuation, character]. For English-like languages, we recommend to use ‘space’. And for Chinese-like languages, we recommend to use ‘character’

Parameters:
  • window_size – window size of shingling

  • lowercase – whether to convert text to lower case first

  • ignore_pattern – whether to ignore sub-strings with specific pattern when computing simhash

  • num_blocks – number of blocks in simhash computing

  • hamming_distance – the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks

compute_hash(sample)[source]

Compute simhash values for the sample.

Parameters:

sample – input sample

Returns:

sample with simhash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.ImageDeduplicator(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching of images between documents.

__init__(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • method – hash method for image

  • consider_text – whether to consider text hash together with image hash when applying deduplication.

  • args – extra args

  • kwargs – extra args

compute_hash(sample, context=False)[source]

Compute hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with computed hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.RayBasicDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]

Bases: Filter

A basic exact matching deduplicator for RAY. Although its functionality is deduplication, it is implemented as Filter sub-class.

EMPTY_HASH_VALUE = 'EMPTY'
__init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]

Initialization. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.deduplicator.RayDocumentDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

__init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Initialization method. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param lowercase: Whether to convert sample text to lower case :param ignore_non_character: Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations :param args: extra args :param kwargs: extra args.

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

class data_juicer.ops.deduplicator.RayImageDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, method: str = 'phash', *args, **kwargs)[source]

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching of images between documents.

__init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, method: str = 'phash', *args, **kwargs)[source]

Initialization. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

class data_juicer.ops.deduplicator.RayVideoDeduplicator(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.

__init__(redis_host: str = 'localhost', redis_port: Annotated[int, Gt(gt=0)] = 6380, *args, **kwargs)[source]

Initialization. :param redis_host: the hostname of redis server :param redis_port: the port of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

class data_juicer.ops.deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.

__init__(consider_text: bool = False, *args, **kwargs)[source]

Initialization.

Parameters:
  • consider_text – whether to consider text hash together with video hash when applying deduplication.

  • args – extra args

  • kwargs – extra args

compute_hash(sample, context=False)[source]

Compute hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with computed hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.