data_juicer.ops.deduplicator¶
- class data_juicer.ops.deduplicator.DocumentDeduplicator(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]¶
基类:
Deduplicator
Deduplicates samples at the document level using exact matching.
This operator computes an MD5 hash for each sample's text. It can optionally convert the text to lowercase and ignore non-alphabet characters, including whitespaces, digits, and punctuation. The deduplication is based on the computed hash values, where samples with identical hashes are considered duplicates. The compute_hash method adds a 'hash' key to each sample, storing its MD5 hash. During processing, the first occurrence of each unique hash is kept, and subsequent duplicates are filtered out. If the show_num parameter is set, the operator also returns a specified number of duplicate pairs for inspection.
- __init__(lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]¶
Initialization method.
- 参数:
lowercase -- Whether to convert sample text to lower case
ignore_non_character -- Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
args -- extra args
kwargs -- extra args.
- class data_juicer.ops.deduplicator.DocumentMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[源代码]¶
基类:
Deduplicator
Deduplicates samples at the document level using MinHash LSH.
This operator computes MinHash values for each sample and uses Locality-Sensitive Hashing (LSH) to identify and remove near-duplicate documents. The Jaccard similarity threshold determines when two documents are considered duplicates. The tokenization method can be customized, and a Hugging Face tokenizer can be used for 'sentencepiece' tokenization. The minhash values are stored as bytes and are not kept in the final dataset. The number of bands and rows per band in LSH can be set manually or determined by an optimal parameter computation algorithm. Important notes: - If using 'punctuation' tokenization with an ignore pattern, ensure the pattern does
not include punctuations.
For 'sentencepiece' tokenization, a tokenizer model path is required.
The deduplication process involves clustering and filtering, and only unique samples or the first sample in a cluster are retained.
- __init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, *args, **kwargs)[源代码]¶
Initialization method.
- 参数:
tokenization -- tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use 'space', for Chinese-like languages, we recommend to use 'character', and for multiple languages, we recommend to use 'sentencepiece'. If using 'sentencepiece', please provided the model path in the 'tokenizer_model' field.
window_size -- window size of shingling
lowercase -- whether to convert text to lower case first
ignore_pattern -- whether to ignore sub-strings with specific pattern when computing minhash
num_permutations -- number of permutations in minhash computing
jaccard_threshold -- the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication
num_bands -- number of bands in LSH. Default it's None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives
num_rows_per_band -- number of rows in each band in LSH. Default it's None, and it will be determined by an optimal params computation algorithm
tokenizer_model -- path for the sentencepiece model, used for sentencepiece tokenization.
- class data_juicer.ops.deduplicator.DocumentSimhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[源代码]¶
基类:
Deduplicator
Deduplicates samples at the document level using SimHash.
This operator computes SimHash values for each sample and removes duplicates based on a specified Hamming distance threshold. It supports different tokenization methods: 'space', 'punctuation', and 'character'. The SimHash is computed over shingles of a given window size, and the deduplication process clusters similar documents and retains only one from each cluster. The default mode converts text to lowercase and can ignore specific patterns. The key metric, Hamming distance, is used to determine similarity between SimHash values. Important notes: - The ignore_pattern parameter can be used to exclude certain substrings during
SimHash computation.
For punctuation-based tokenization, the ignore_pattern should not include punctuations to avoid conflicts.
The hamming_distance must be less than the number of blocks (num_blocks).
Only the first sample in each cluster is retained by default.
- __init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 6, lowercase: bool = True, ignore_pattern: str | None = None, num_blocks: Annotated[int, Gt(gt=0)] = 6, hamming_distance: Annotated[int, Gt(gt=0)] = 4, *args, **kwargs)[源代码]¶
Initialization method :param tokenization: tokenization method for sample texts.
It should be one of [space, punctuation, character]. For English-like languages, we recommend to use 'space'. And for Chinese-like languages, we recommend to use 'character'
- 参数:
tokenization -- tokenization method for sample texts
window_size -- window size of shingling
lowercase -- whether to convert text to lower case first
ignore_pattern -- whether to ignore sub-strings with specific pattern when computing simhash
num_blocks -- number of blocks in simhash computing
hamming_distance -- the max hamming distance threshold in near-duplicate detection. When the hamming distance of two sample texts is <= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication. This threshold should be always less than num_blocks
- class data_juicer.ops.deduplicator.ImageDeduplicator(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[源代码]¶
基类:
Deduplicator
Deduplicates samples at the document level by exact matching of images.
This operator compares images across documents to identify and remove duplicates. - It uses a specified hash method (default is 'phash') to compute image hashes. - If consider_text is set, it also considers text content for deduplication, using a text deduplicator in conjunction with the image hashes. - The key metric, imagehash, is computed for each sample. If consider_text is enabled, an additional hash field is used. - Duplicates are identified by comparing these hash values. Samples with identical hashes are considered duplicates. - When show_num is greater than 0, the operator also returns a subset of duplicate pairs for tracing purposes. - The operator caches the imagehash and, if applicable, the hash fields.
- __init__(method: str = 'phash', consider_text: bool = False, *args, **kwargs)[源代码]¶
Initialization method.
- 参数:
method -- hash method for image
consider_text -- whether to consider text hash together with image hash when applying deduplication.
args -- extra args
kwargs -- extra args
- class data_juicer.ops.deduplicator.RayBasicDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[源代码]¶
基类:
Filter
A basic exact matching deduplicator for RAY. Although its functionality is deduplication, it is implemented as Filter sub-class.
- EMPTY_HASH_VALUE = 'EMPTY'¶
- __init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[源代码]¶
Initialization. :param backend: the backend for dedup, either 'ray_actor' or 'redis' :param redis_address: the address of redis server :param args: extra args :param kwargs: extra args
- compute_stats_single(sample, context=False)[源代码]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- 参数:
sample -- input sample.
context -- whether to store context information of intermediate vars in the sample temporarily.
- 返回:
sample with computed stats
- class data_juicer.ops.deduplicator.RayDocumentDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]¶
-
Deduplicates samples at the document level using exact matching in Ray distributed mode.
This operator computes a hash for each document and filters out duplicates based on exact matches. The hash is calculated from the text content, which can be optionally converted to lowercase and stripped of non-alphabet characters. The key metric used for deduplication is the MD5 hash of the processed text. If the lowercase parameter is set, the text is converted to lowercase before hashing. If ignore_non_character is enabled, all non-alphabet characters, including whitespaces, digits, and punctuation, are removed. The operator supports two backends: 'ray_actor' and 'redis', with the default being 'ray_actor'.
- __init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]¶
Initialization method. :param backend: the backend for dedup, either 'ray_actor' or 'redis' :param redis_address: the address of redis server :param lowercase: Whether to convert sample text to lower case :param ignore_non_character: Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations :param args: extra args :param kwargs: extra args.
- class data_juicer.ops.deduplicator.RayImageDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', method: str = 'phash', *args, **kwargs)[源代码]¶
-
Deduplicates samples at the document level using exact matching of images in Ray distributed mode.
This operator uses a specified hash method to compute image hashes and identifies duplicates by comparing these hashes. It operates in Ray distributed mode, supporting 'ray_actor' or 'redis' backends for deduplication. The hash method can be set during initialization, with supported methods listed in HASH_METHOD. If a sample does not contain an image, it is assigned an empty hash value. The operator loads images from the specified keys and computes their combined hash for comparison.
- __init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', method: str = 'phash', *args, **kwargs)[源代码]¶
Initialization. :param backend: the backend for dedup, either 'ray_actor' or 'redis' :param redis_address: the address of redis server :param method: the hash method to use :param args: extra args :param kwargs: extra args
- class data_juicer.ops.deduplicator.RayVideoDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[源代码]¶
-
Deduplicates samples at document-level using exact matching of videos in Ray distributed mode.
This operator computes the MD5 hash of video streams in each sample and compares them to identify duplicates. It uses Ray distributed mode for parallel processing. The hash is computed by demuxing the video streams and updating the MD5 hash with each video packet. If a sample does not contain a valid video, it is assigned an empty hash value. The operator supports 'ray_actor' or 'redis' backends for deduplication.
- class data_juicer.ops.deduplicator.RayBTSMinhashDeduplicator(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, union_find_parallel_num: int | str = 'auto', union_threshold: int | None = 256, max_pending_edge_buffer_task: int | None = 20, num_edge_buffer_task_returns: int | None = 10, max_pending_filter_tasks: int | None = 20, num_filter_task_returns: int | None = 10, merge_batch_size: int | None = 1000, minhash_batch_size: int | str | None = 'auto', memory_per_sample: float | None = 0.1, *args, **kwargs)[源代码]¶
基类:
Deduplicator
A MinhashLSH deduplicator that operates in Ray distributed mode.
This operator uses the MinHash LSH technique to identify and remove near-duplicate samples from a dataset. It supports various tokenization methods, including space, punctuation, character, and sentencepiece. The Jaccard similarity threshold is used to determine if two samples are considered duplicates. If the Jaccard similarity of two samples is greater than or equal to the specified threshold, one of the samples is filtered out. The operator computes the MinHash values for each sample and uses a union- find algorithm to group similar samples. The key metric, Jaccard similarity, is computed based on the shingling of the text. The operator can run on both CPU and GPU, with specific batch size and memory configurations for each.
- EMPTY_HASH_VALUE = 'EMPTY'¶
- __init__(tokenization: str = 'space', window_size: Annotated[int, Gt(gt=0)] = 5, lowercase: bool = True, ignore_pattern: str | None = None, num_permutations: Annotated[int, Gt(gt=0)] = 256, jaccard_threshold: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.7, num_bands: Annotated[int, Gt(gt=0)] | None = None, num_rows_per_band: Annotated[int, Gt(gt=0)] | None = None, tokenizer_model: str | None = None, union_find_parallel_num: int | str = 'auto', union_threshold: int | None = 256, max_pending_edge_buffer_task: int | None = 20, num_edge_buffer_task_returns: int | None = 10, max_pending_filter_tasks: int | None = 20, num_filter_task_returns: int | None = 10, merge_batch_size: int | None = 1000, minhash_batch_size: int | str | None = 'auto', memory_per_sample: float | None = 0.1, *args, **kwargs)[源代码]¶
Initialization method.
- 参数:
tokenization -- tokenization method for sample texts. It should be one of [space, punctuation, character, sentencepiece]. For English-like languages, we recommend to use 'space', for Chinese-like languages, we recommend to use 'character', and for multiple languages, we recommend to use 'sentencepiece'. If using 'sentencepiece', please provided the model path in the 'tokenizer_model' field.
window_size -- window size of shingling
lowercase -- whether to convert text to lower case first
ignore_pattern -- whether to ignore sub-strings with specific pattern when computing minhash
num_permutations -- number of permutations in minhash computing
jaccard_threshold -- the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication
num_bands -- number of bands in LSH. Default it's None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives
num_rows_per_band -- number of rows in each band in LSH. Default it's None, and it will be determined by an optimal params computation algorithm
tokenizer_model -- path for the sentencepiece model, used for sentencepiece tokenization.
union_find_parallel_num -- number of parallel workers for union-find algorithm. Default it's 'auto', and it will be determined by half of the number of CPUs.
union_threshold -- threshold for minhash values group to perform union-find algorithm. Default it's 256.
max_pending_edge_buffer_task -- max number of pending edge buffer ray tasks. Default it's 20.
num_edge_buffer_task_returns -- number of edge buffer tasks for ray.wait to return. Default it's 10.
max_pending_filter_tasks -- max number of pending filter ray tasks. Default it's 20.
num_filter_task_returns -- number of filter tasks for ray.wait to return. Default it's 10.
merge_batch_size -- batch size for BTS operations. Default it's 1000.
minhash_batch_size -- batch size for MinHash computation. If "auto", it will be set to default value on CPU(1024), or auto calculated per available GPU memory and memory_per_sample setting for GPU.
memory_per_sample -- estimated memory needed per sample in MB. Used to calculate batch size based on available GPU memory. Default is 0.1 MB per sample.
- band_minhash(minhash_list, uid_list)[源代码]¶
Logic for creating and pusing LSH bands to the union find list
- class data_juicer.ops.deduplicator.VideoDeduplicator(consider_text: bool = False, *args, **kwargs)[源代码]¶
基类:
Deduplicator
Deduplicates samples at the document level using exact matching of videos.
This operator computes a hash for each video in the sample and uses it to identify and remove duplicate documents. If consider_text is set to True, it also considers the text hash alongside the video hash for deduplication. The video hash is computed by hashing the video data, including all video streams in the container. The operator supports sampling and tracing of duplicate pairs when the show_num parameter is greater than 0. Important fields used for caching include 'videohash' and optionally 'hash' if text is considered.
- __init__(consider_text: bool = False, *args, **kwargs)[源代码]¶
Initialization.
- 参数:
consider_text -- whether to consider text hash together with video hash when applying deduplication.
args -- extra args
kwargs -- extra args