data_juicer.ops.deduplicator¶

class data_juicer.ops.deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

Using md5 hash to deduplicate samples.

__init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶

Initialization method.

Parameters:

lowercase – Whether to convert sample text to lower case
ignore_non_character – Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args – extra args
kwargs – extra args.

compute_hash(sample)[source]¶

Compute md5 hash values for the sample.

Parameters:: sample – input sample
Returns:: sample with md5 hash value.

process(dataset, show_num=0)[source]¶

For doc-level, dataset –> dataset.

Parameters:

dataset – input dataset
show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.DocumentMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]¶

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using MinHashLSH.

Different from simhash, minhash is stored as bytes, so they won’t be kept in the final dataset.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[source]¶

Initialization method.

Parameters:

tokenization – tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use ‘space’, for Chinese-like languages, we recommend to use ‘character’, and for multiple languages, we recommend to use ‘sentencepiece’. If using ‘sentencepiece’, please provided the model path in the ‘tokenizer_model’ field.
window_size – window size of shingling
lowercase – whether to convert text to lower case first
ignore_pattern – whether to ignore sub-strings with specific pattern when computing minhash
num_permutations – number of permutations in minhash computing
jaccard_threshold – the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication
num_bands – number of bands in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives
num_rows_per_band – number of rows in each band in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm
tokenizer_model – path for the sentencepiece model, used for sentencepiece tokenization.

compute_hash(sample)[source]¶

Compute minhash values for the sample.

Parameters:: sample – input sample
Returns:: sample with minhash value.

process(dataset, show_num=0)[source]¶

For doc-level, dataset –> dataset.

Parameters:

dataset – input dataset
show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]¶

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using SimHash.

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[source]¶

Initialization method :param tokenization: tokenization method for sample texts.

It should be one of [space, punctuation, character]. For English-like languages, we recommend to use ‘space’. And for Chinese-like languages, we recommend to use ‘character’

Parameters:

window_size – window size of shingling
lowercase – whether to convert text to lower case first
ignore_pattern – whether to ignore sub-strings with specific pattern when computing simhash
num_blocks – number of blocks in simhash computing
hamming_distance – the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks

compute_hash(sample)[source]¶

Compute simhash values for the sample.

Parameters:: sample – input sample
Returns:: sample with simhash value.

process(dataset, show_num=0)[source]¶

For doc-level, dataset –> dataset.

Parameters:

dataset – input dataset
show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.ImageDeduplicator(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]¶

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching of images between documents.

__init__(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[source]¶

Initialization method.

Parameters:

method – hash method for image
consider_text – whether to consider text hash together with image hash when applying deduplication.
args – extra args
kwargs – extra args

compute_hash(sample, context=False)[source]¶

Compute hash values for the sample.

Parameters:: sample – input sample
Returns:: sample with computed hash value.

process(dataset, show_num=0)[source]¶

For doc-level, dataset –> dataset.

Parameters:

dataset – input dataset
show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

class data_juicer.ops.deduplicator.RayBasicDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[source]¶

Bases: Filter

A basic exact matching deduplicator for RAY. Although its functionality is deduplication, it is implemented as Filter sub-class.

EMPTY_HASH_VALUE = 'EMPTY'¶

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[source]¶: Initialization. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]¶: Calculate hash value for the sample.

compute_stats_single(sample, context=False)[source]¶

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:

sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]¶

For sample level, sample –> Boolean.

Parameters:: sample – sample to decide whether to filter
Returns:: true for keeping and false for filtering

class data_juicer.ops.deduplicator.RayDocumentDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶: Initialization method. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param lowercase: Whether to convert sample text to lower case :param ignore_non_character: Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations :param args: extra args :param kwargs: extra args.

calculate_hash(sample, context=False)[source]¶: Calculate hash value for the sample.

class data_juicer.ops.deduplicator.RayImageDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', method: str = 'phash', *args, **kwargs)[source]¶

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching of images between documents.

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', method: str = 'phash', *args, **kwargs)[source]¶: Initialization. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]¶: Calculate hash value for the sample.

class data_juicer.ops.deduplicator.RayVideoDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[source]¶

Bases: RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[source]¶: Initialization. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param args: extra args :param kwargs: extra args

calculate_hash(sample, context=False)[source]¶: Calculate hash value for the sample.

class data_juicer.ops.deduplicator.RayBTSMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, union_find_parallel_num: int | str = 'auto', union_threshold: int | None = 256, max_pending_edge_buffer_task: int | None = 20, num_edge_buffer_task_returns: int | None = 10, max_pending_filter_tasks: int | None = 20, num_filter_task_returns: int | None = 10, merge_batch_size: int | None = 1000, *args, **kwargs)[source]¶

Bases: Deduplicator

A MinhashLSH deduplicator based on RAY.

EMPTY_HASH_VALUE = 'EMPTY'¶

__init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, union_find_parallel_num: int | str = 'auto', union_threshold: int | None = 256, max_pending_edge_buffer_task: int | None = 20, num_edge_buffer_task_returns: int | None = 10, max_pending_filter_tasks: int | None = 20, num_filter_task_returns: int | None = 10, merge_batch_size: int | None = 1000, *args, **kwargs)[source]¶

Initialization method.

Parameters:

tokenization – tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use ‘space’, for Chinese-like languages, we recommend to use ‘character’, and for multiple languages, we recommend to use ‘sentencepiece’. If using ‘sentencepiece’, please provided the model path in the ‘tokenizer_model’ field.
window_size – window size of shingling
lowercase – whether to convert text to lower case first
ignore_pattern – whether to ignore sub-strings with specific pattern when computing minhash
num_permutations – number of permutations in minhash computing
jaccard_threshold – the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication
num_bands – number of bands in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives
num_rows_per_band – number of rows in each band in LSH. Default it’s None, and it will be determined by an optimal params computation algorithm
tokenizer_model – path for the sentencepiece model, used for sentencepiece tokenization.
union_find_parallel_num – number of parallel workers for union-find algorithm. Default it’s ‘auto’, and it will be determined by half of the number of CPUs.
union_threshold – threshold for minhash values group to perform union-find algorithm. Default it’s 256.
max_pending_edge_buffer_task – max number of pending edge buffer ray tasks. Default it’s 20.
num_edge_buffer_task_returns – number of edge buffer tasks for ray.wait to return. Default it’s 10.
max_pending_filter_tasks – max number of pending filter ray tasks. Default it’s 20.
num_filter_task_returns – number of filter tasks for ray.wait to return. Default it’s 10.
merge_batch_size – batch size for BTS operations. Default it’s 1000.
tmp_file_name – the temporary folder name for deduplication.

calc_minhash(text_list: Array, uid_list: List) → Table[source]¶

merge_op_batch(object_refs)[source]¶

merge()[source]¶

filter_with_union_find(samples: Table) → Table[source]¶

run(dataset, **kwargs)[source]¶

class data_juicer.ops.deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[source]¶

Bases: Deduplicator

Deduplicator to deduplicate samples at document-level using exact matching of videos between documents.

__init__(consider_text: bool = False, *args, **kwargs)[source]¶

Initialization.

Parameters:

consider_text – whether to consider text hash together with video hash when applying deduplication.
args – extra args
kwargs – extra args

compute_hash(sample, context=False)[source]¶

Compute hash values for the sample.

Parameters:: sample – input sample
Returns:: sample with computed hash value.

process(dataset, show_num=0)[source]¶

For doc-level, dataset –> dataset.

Parameters:

dataset – input dataset
show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.