data_juicer.ops.deduplicator

class data_juicer.ops.deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

Using md5 hash to deduplicate samples.

__init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lowercase – Whether to convert sample text to lower case

  • ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations

  • args – extra args

  • kwargs – extra args.

compute_hash(sample)[source]

Compute md5 hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with md5 hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.DocumentMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using MinHashLSH.

Different from simhash, minhash is stored as bytes, so they won’t be kept in the final dataset.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • tokenization – tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use ‘space’, for Chinese-like languages, we recommend to use ‘character’, and for multiple languages, we recommend to use ‘sentencepiece’. If using ‘sentencepiece’, please provided the model path in the ‘tokenizer_model’ field.

  • window_size – window size of shingling

  • lowercase – whether to convert text to lower case first

  • ignore_pattern – whether to ignore sub-strings with specific pattern when computing minhash

  • num_permutations – number of permutations in minhash computing

  • jaccard_threshold – the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication

  • num_bands – number of bands in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives

  • num_rows_per_band – number of rows in each band in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm

  • tokenizer_model – path for the sentencepiece model, used for sentencepiece tokenization.

compute_hash(sample)[source]

Compute minhash values for the sample.

Parameters:

sample – input sample

Returns:

sample with minhash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using SimHash.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]

Initialization method :param tokenization: tokenization method for sample texts.

It should be one of [space, punctuation, character]. For English-like languages, we recommend to use ‘space’. And for Chinese-like languages, we recommend to use ‘character’

Parameters:
  • window_size – window size of shingling

  • lowercase – whether to convert text to lower case first

  • ignore_pattern – whether to ignore sub-strings with specific pattern when computing simhash

  • num_blocks – number of blocks in simhash computing

  • hamming_distance – the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks

compute_hash(sample)[source]

Compute simhash values for the sample.

Parameters:

sample – input sample

Returns:

sample with simhash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.ImageDeduplicator(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching of images between documents.

__init__(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • method – hash method for image

  • consider_text – whether to consider text hash together with image hash when applying deduplication.

  • args – extra args

  • kwargs – extra args

compute_hash(sample, context=False)[source]

Compute hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with computed hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.RayBasicDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[source]

Bases: Filter

A basic exact matching deduplicator for RAY. Although its functionality is deduplication, it is implemented as Filter sub-class.

EMPTY_HASH_VALUE = 'EMPTY'
__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[source]

Initialization. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.deduplicator.RayDocumentDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]

Initialization method. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param lowercase: Whether to convert sample text to lower case :param ignore_non_character: Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations :param args: extra args :param kwargs: extra args.

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

class data_juicer.ops.deduplicator.RayImageDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', method: str = 'phash', *args, **kwargs)[source]

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching of images between documents.

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', method: str = 'phash', *args, **kwargs)[source]

Initialization. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

class data_juicer.ops.deduplicator.RayVideoDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[source]

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[source]

Initialization. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]

Calculate hash value for the sample.

class data_juicer.ops.deduplicator.RayBTSMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, union_find_parallel_num: int | str = 'auto', union_threshold: int | None = 256, max_pending_edge_buffer_task: int | None = 20, num_edge_buffer_task_returns: int | None = 10, max_pending_filter_tasks: int | None = 20, num_filter_task_returns: int | None = 10, merge_batch_size: int | None = 1000, *args, **kwargs)[source]

Bases: Deduplicator

A MinhashLSH deduplicator based on RAY.

EMPTY_HASH_VALUE = 'EMPTY'
__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, union_find_parallel_num: int | str = 'auto', union_threshold: int | None = 256, max_pending_edge_buffer_task: int | None = 20, num_edge_buffer_task_returns: int | None = 10, max_pending_filter_tasks: int | None = 20, num_filter_task_returns: int | None = 10, merge_batch_size: int | None = 1000, *args, **kwargs)[source]

Initialization method.

Parameters:
  • tokenization – tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use ‘space’, for Chinese-like languages, we recommend to use ‘character’, and for multiple languages, we recommend to use ‘sentencepiece’. If using ‘sentencepiece’, please provided the model path in the ‘tokenizer_model’ field.

  • window_size – window size of shingling

  • lowercase – whether to convert text to lower case first

  • ignore_pattern – whether to ignore sub-strings with specific pattern when computing minhash

  • num_permutations – number of permutations in minhash computing

  • jaccard_threshold – the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication

  • num_bands – number of bands in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives

  • num_rows_per_band – number of rows in each band in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm

  • tokenizer_model – path for the sentencepiece model, used for sentencepiece tokenization.

  • union_find_parallel_num – number of parallel workers for union-find algorithm. Default it’s ‘auto’, and it will be determined by half of the number of CPUs.

  • union_threshold – threshold for minhash values group to perform union-find algorithm. Default it’s 256.

  • max_pending_edge_buffer_task – max number of pending edge buffer ray tasks. Default it’s 20.

  • num_edge_buffer_task_returns – number of edge buffer tasks for ray.wait to return. Default it’s 10.

  • max_pending_filter_tasks – max number of pending filter ray tasks. Default it’s 20.

  • num_filter_task_returns – number of filter tasks for ray.wait to return. Default it’s 10.

  • merge_batch_size – batch size for BTS operations. Default it’s 1000.

  • tmp_file_name – the temporary folder name for deduplication.

calc_minhash(text_list: Array, uid_list: List) Table[source]
merge_op_batch(object_refs)[source]
merge()[source]
filter_with_union_find(samples: Table) Table[source]
run(dataset)[source]
class data_juicer.ops.deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[source]

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.

__init__(consider_text: bool = False, *args, **kwargs)[source]

Initialization.

Parameters:
  • consider_text – whether to consider text hash together with video hash when applying deduplication.

  • args – extra args

  • kwargs – extra args

compute_hash(sample, context=False)[source]

Compute hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with computed hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.