data_juicer.ops.filter

class data_juicer.ops.filter.ImageTextSimilarityFilter(hf_clip: str = 'openai/clip-vit-base-patch32', trust_remote_code: bool = False, min_score: float = 0.1, max_score: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples those similarities between image and text within a specific range.

__init__(hf_clip: str = 'openai/clip-vit-base-patch32', trust_remote_code: bool = False, min_score: float = 0.1, max_score: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_clip – clip model name on huggingface to compute the similarity between image and text.

  • min_score – The min similarity to keep samples.

  • max_score – The max similarity to keep samples.

  • horizontal_flip – Flip image horizontally (left to right).

  • vertical_flip – Flip image vertically (top to bottom).

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • reduce_mode – reduce mode when one text corresponds to multiple images in a chunk. ‘avg’: Take the average of multiple values ‘max’: Take the max of multiple values ‘min’: Take the min of multiple values

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoAspectRatioFilter(min_ratio: str = '9/21', max_ratio: str = '21/9', any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with video aspect ratio within a specific range. AspectRatio = W / H.

__init__(min_ratio: str = '9/21', max_ratio: str = '21/9', any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_ratio – The minimum aspect ratio to keep samples, supported format is a string, such as “9:21” or “9/21”.

  • max_ratio – The maximum aspect ratio to keep samples, supported format is a string, such as “21:9” or “21/9”.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageTextMatchingFilter(hf_blip: str = 'Salesforce/blip-itm-base-coco', trust_remote_code: bool = False, min_score: float = 0.003, max_score: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples those matching score between image and text within a specific range.

__init__(hf_blip: str = 'Salesforce/blip-itm-base-coco', trust_remote_code: bool = False, min_score: float = 0.003, max_score: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_blip – blip model name on huggingface to compute the matching score between image and text.

  • min_score – The min matching score to keep samples.

  • max_score – The max matching score to keep samples.

  • horizontal_flip – Flip image horizontally (left to right).

  • vertical_flip – Flip image vertically (top to bottom).

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • reduce_mode – reduce mode when one text corresponds to multiple images in a chunk. ‘avg’: Take the average of multiple values ‘max’: Take the max of multiple values ‘min’: Take the min of multiple values

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageNSFWFilter(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, score_threshold: float = 0.5, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples whose images have low nsfw scores.

__init__(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, score_threshold: float = 0.5, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_nsfw_model – nsfw detection model name on huggingface.

  • score_threshold – the nsfw score threshold for samples. range from 0 to 1. Samples with nsfw score less than this threshold will be kept.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.TokenNumFilter(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with total token number within a specific range.

__init__(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_tokenizer – the tokenizer name of Hugging Face tokenizers.

  • min_num – The min filter token number in this op, samples will be filtered if their token number is below this parameter.

  • max_num – The max filter token number in this op, samples will be filtered if their token number exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.TextLengthFilter(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with total text length within a specific range.

__init__(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_len – The min text length in the filtering. samples will be filtered if their text length is below this parameter.

  • max_len – The max text length in the filtering. samples will be filtered if their text length exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples)[source]
process_batched(samples)[source]
class data_juicer.ops.filter.SpecifiedNumericFieldFilter(field_key: str = '', min_value: float = -9223372036854775807, max_value: float = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter based on specified numeric field information.

If the specified numeric information in the sample is not within the specified range, the sample will be filtered.

__init__(field_key: str = '', min_value: float = -9223372036854775807, max_value: float = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • field_key – Filter based on the specified numeric value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by ‘.’.

  • min_value – The min filter value in SpecifiedNumericField op, samples will be filtered if their specified numeric field value is below this parameter.

  • max_value – The max filter value in SpecifiedNumericField op, samples will be filtered if their specified numeric field value exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.AudioNMFSNRFilter(min_snr: float = 0, max_snr: float = 9223372036854775807, nmf_iter_num: int[int] = 500, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Keep data samples whose audios’ SNRs (computed based on NMF) are within a specified range.

__init__(min_snr: float = 0, max_snr: float = 9223372036854775807, nmf_iter_num: int[int] = 500, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_snr – The min audio SNR to keep samples in dB. It’s 0 by default.

  • max_snr – The max audio SNR to keep samples in dB. It’s sys.maxsize by default.

  • nmf_iter_num – The max number of iterations to run NMF. It’s 500 in default.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all audios. ‘any’: keep this sample if any audios meet the condition. ‘all’: keep this sample only if all audios meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoAestheticsFilter(hf_scorer_model: str = '', trust_remote_code: bool = False, min_score: float = 0.4, max_score: float = 1.0, frame_sampling_method: str = 'uniform', frame_num: int[int] = 3, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[source]

Bases: Filter

Filter to keep data samples with aesthetics scores for specified frames in the videos within a specific range.

__init__(hf_scorer_model: str = '', trust_remote_code: bool = False, min_score: float = 0.4, max_score: float = 1.0, frame_sampling_method: str = 'uniform', frame_num: int[int] = 3, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_scorer_model – Huggingface model name for the aesthetics predictor. By default, we will use ‘shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE’, refer to pypi.org/project/simple-aesthetics-predictor

  • min_score – Min score for the predicted aesthetics in a video.

  • max_score – Max score for the predicted aesthetics in a video.

  • frame_sampling_method – sampling method of extracting frame images from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: “uniform” with frame_num=3, considering that the number of keyframes can be large while their difference is usually small in terms of their aesthetics.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • any_or_all – Keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • reduce_mode – reduce mode when one sample corresponds to multiple frames, must be one of [‘avg’,’max’, ‘min’]. ‘avg’: Take the average of multiple values ‘max’: Take the max of multiple values ‘min’: Take the min of multiple values

  • args – Extra positional arguments.

  • kwargs – Extra keyword arguments.

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.PerplexityFilter(lang: str = 'en', max_ppl: float = 1500, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with perplexity score less than a specific max value.

__init__(lang: str = 'en', max_ppl: float = 1500, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – Compute perplexity for samples in which language.

  • max_ppl – The max filter perplexity in this op, samples will be filtered if their perplexity exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples, context=False)[source]
process_batched(samples)[source]
class data_juicer.ops.filter.PhraseGroundingRecallFilter(hf_owlvit: str = 'google/owlvit-base-patch32', trust_remote_code: bool = False, min_recall: float = 0.1, max_recall: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', iou_thr: float = 0.5, large_area_ratio_thr: float = 0.95, conf_thr: float = 0.0, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples whose locating recalls of phrases extracted from text in the images are within a specified range.

__init__(hf_owlvit: str = 'google/owlvit-base-patch32', trust_remote_code: bool = False, min_recall: float = 0.1, max_recall: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', iou_thr: float = 0.5, large_area_ratio_thr: float = 0.95, conf_thr: float = 0.0, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_owlvit – Owl-ViT model name on huggingface to locate the phrases extracted from the text.

  • min_recall – The min phrase grounding recall to keep samples.

  • max_recall – The max phrase grounding recall to keep samples.

  • horizontal_flip – Flip image horizontally (left to right).

  • vertical_flip – Flip image vertically (top to bottom).

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • reduce_mode – reduce mode when one text corresponds to multiple images in a chunk. ‘avg’: Take the average of multiple values ‘max’: Take the max of multiple values ‘min’: Take the min of multiple values

  • iou_thr – the IoU threshold for NMS-like post-process. If two predicted bboxes are overlap with an IoU larger than this threshold, the bbox with less confidence will be removed. Default: 0.5.

  • large_area_ratio_thr – the area ratio threshold for filtering out those large predicted bboxes. If the area of a predicted bbox accounts for more than this ratio threshold of the whole image area, this bbox will be removed. Default: 0.95.

  • conf_thr – the confidence score threshold for removing low-confidence bboxes. If the confidence score of a predicted bbox is lower than the threshold, this bbox will be removed. Default: 0.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.MaximumLineLengthFilter(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with maximum line length within a specific range.

__init__(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_len – The min filter length in this op, samples will be filtered if their maximum line length is below this parameter.

  • max_len – The max filter length in this op, samples will be filtered if their maximum line length exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples, context=False)[source]
process_batched(samples)[source]
class data_juicer.ops.filter.AverageLineLengthFilter(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with average line length within a specific range.

__init__(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_len – The min filter length in this op, samples will be filtered if their average line length is below this parameter.

  • max_len – The max filter length in this op, samples will be filtered if their average line length exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples, context=False)[source]
process_batched(samples)[source]
class data_juicer.ops.filter.SpecifiedFieldFilter(field_key: str = '', target_value: List = [], *args, **kwargs)[source]

Bases: Filter

Filter based on specified field information.

If the specified field information in the sample is not within the specified target value, the sample will be filtered.

__init__(field_key: str = '', target_value: List = [], *args, **kwargs)[source]

Initialization method.

Parameters:
  • field_key – Filter based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by ‘.’.

  • target_value – The range of specified field information corresponding to the samples that need to be retained.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoTaggingFromFramesFilter(tags: List[str] = ['people'], contain: str = 'any', frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, tag_field_name: str = '__dj__video_frame_tags__', any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples whose videos contain the given tags.

__init__(tags: List[str] = ['people'], contain: str = 'any', frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, tag_field_name: str = '__dj__video_frame_tags__', any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • tags – a tag list to shift the videos, total tags can be found in https://github.com/xinyu1205/recognize-anything/blob/main/ram/data/ram_tag_list.txt # noqa: E501

  • contain – require the videos containing ‘any’ or ‘all’ tags. When tags equal to [], ‘all’ keeps all samples, ‘any’ keeps no sample.

  • frame_sampling_method – sampling method of extracting frame images from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • tag_field_name – the field name to store the tags. It’s “__dj__video_frame_tags__” in default.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.TextEntityDependencyFilter(lang: str = 'en', min_dependency_num: int = 1, any_or_all: str = 'all', *args, **kwargs)[source]

Bases: Filter

Identify the entities in the text which are independent with other token, and filter them. The text containing no entities will be omitted.

__init__(lang: str = 'en', min_dependency_num: int = 1, any_or_all: str = 'all', *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – language of the text in the samples. ‘en’ for detection of entities in English and ‘zh’ for detection of entities in Chinese.

  • mini_dependency_num – The min token number in the filtering. Objects is independent if their number of edges in the dependency tree is below this parameter.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy. ‘any’: keep this sample if any objet is dependent. ‘all’: keep this sample only if all images are dependent.

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoResolutionFilter(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Keep data samples whose videos’ resolutions are within a specified range.

__init__(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_width – The min horizontal resolution.

  • max_width – The max horizontal resolution.

  • min_height – The min vertical resolution.

  • max_height – The max vertical resolution.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.AlphanumericFilter(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with alphabet/numeric ratio within a specific range.

__init__(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • tokenization – Whether to count the ratio of alphanumeric to the total number of tokens. if tokenization=False, it will count the ratio of alphanumeric to the total number of characters.

  • min_ratio – The min filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio is below this parameter.

  • max_ratio – The max filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples)[source]
process_batched(samples)[source]
class data_juicer.ops.filter.ImageWatermarkFilter(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples whose images have no watermark with high probability.

__init__(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_watermark_model – watermark detection model name on huggingface.

  • prob_threshold – the predicted watermark probability threshold for samples. range from 0 to 1. Samples with watermark probability less than this threshold will be kept.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageAestheticsFilter(hf_scorer_model: str = '', trust_remote_code: bool = False, min_score: float = 0.5, max_score: float = 1.0, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with aesthetics scores within a specific range.

__init__(hf_scorer_model: str = '', trust_remote_code: bool = False, min_score: float = 0.5, max_score: float = 1.0, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_scorer_model – Huggingface model name for the aesthetics predictor. By default, we will use ‘shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE’, refer to pypi.org/project/simple-aesthetics-predictor

  • min_score – Min score for the predicted aesthetics in an image.

  • max_score – Max score for the predicted aesthetics in an image.

  • any_or_all – Keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – Extra positional arguments.

  • kwargs – Extra keyword arguments.

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.AudioSizeFilter(min_size: str = '0', max_size: str = '1TB', any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Keep data samples whose audio size (in bytes/kb/MB/…) within a specific range.

__init__(min_size: str = '0', max_size: str = '1TB', any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_size – The min audio size to keep samples. set to be “0” by default for no size constraint

  • max_size – The max audio size to keep samples. set to be “1Tb” by default, an approximate for un-limited case

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all audios. ‘any’: keep this sample if any audios meet the condition. ‘all’: keep this sample only if all audios meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.StopWordsFilter(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[int[int]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with stopword ratio larger than a specific min value.

__init__(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[int[int]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – Consider stopwords in what language. If lang == “all”, we will adopt the one merged from all the available languages

  • tokenization – whether to use model to tokenize documents

  • min_ratio – The min filter ratio in this op.

  • stopwords_dir – The directory storing the stopwords file(s) whose name includes “stopwords” and in json format

  • use_words_aug – Whether to augment words, especially for Chinese and Vietnamese

  • words_aug_group_sizes – The group size of words to augment

  • words_aug_join_char – The join char between words to augment

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.CharacterRepetitionFilter(rep_len: int[int] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with char-level n-gram repetition ratio within a specific range.

__init__(rep_len: int[int] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[source]

Initialization method.

Parameters:
  • rep_len – Repetition length for char-level n-gram.

  • min_ratio – The min filter ratio in this op, samples will be filtered if their char-level n-gram repetition ratio is below this parameter.

  • max_ratio – The max filter ratio in this op, samples will be filtered if their char-level n-gram repetition ratio exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples)[source]
process_batched(samples)[source]
class data_juicer.ops.filter.ImageShapeFilter(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with image shape (w, h) within specific ranges.

__init__(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_width – The min width to keep samples.

  • max_width – The max width to keep samples.

  • min_height – The min height to keep samples.

  • max_height – The max height to keep samples.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoDurationFilter(min_duration: float = 0, max_duration: float = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Keep data samples whose videos’ durations are within a specified range.

__init__(min_duration: float = 0, max_duration: float = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_duration – The min video duration to keep samples in seconds. It’s 0 by default.

  • max_duration – The max video duration to keep samples in seconds. It’s sys.maxsize by default.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.TextActionFilter(lang: str = 'en', min_action_num: int = 1, *args, **kwargs)[source]

Bases: Filter

Filter to keep texts those contain actions in the text.

__init__(lang: str = 'en', min_action_num: int = 1, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – language of the text in the samples. ‘en’ for detection of actions in English and ‘zh’ for detection of actions in Chinese.

  • mini_action_num – The min action number in the filtering. samples will be filtered if their action number in the text is below this parameter.

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoOcrAreaRatioFilter(min_area_ratio: float = 0, max_area_ratio: float = 1.0, frame_sample_num: int[int] = 3, languages_to_detect: str | List[str] = ['ch_sim', 'en'], any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Keep data samples whose detected text area ratios for specified frames in the video are within a specified range.

__init__(min_area_ratio: float = 0, max_area_ratio: float = 1.0, frame_sample_num: int[int] = 3, languages_to_detect: str | List[str] = ['ch_sim', 'en'], any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_area_ratio – The min ocr area ratio to keep samples. It’s 0 by default.

  • max_area_ratio – The max ocr area ratio to keep samples. It’s 1.0 by default.

  • frame_sample_num – The number of sampled frames to calculate the ocr area ratio. If it’s 1, only middle frame will be selected. If it’s 2, only the first and the last frames will be selected. If it’s larger than 2, in addition to the first and the last frames, other frames will be sampled evenly within the video duration.

  • languages_to_detect – texts in which languages should be detected. Default: [‘ch_sim’, ‘en’]. Full language list can be found here: https://www.jaided.ai/easyocr/.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • args – extra args

  • kwargs – extra args

get_reader(rank)[source]
compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoNSFWFilter(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, score_threshold: float = 0.5, frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples whose videos have low nsfw scores.

__init__(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, score_threshold: float = 0.5, frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_nsfw_model – nsfw detection model name on huggingface.

  • score_threshold – the nsfw score threshold for samples. range from 0 to 1. Samples with nsfw score less than this threshold will be kept.

  • frame_sampling_method – sampling method of extracting frame images from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • reduce_mode – reduce mode for multiple sampled video frames. ‘avg’: Take the average of multiple values ‘max’: Take the max of multiple values ‘min’: Take the min of multiple values

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.SpecialCharactersFilter(min_ratio: float = 0.0, max_ratio: float = 0.25, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with special-char ratio within a specific range.

__init__(min_ratio: float = 0.0, max_ratio: float = 0.25, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_ratio – The min filter ratio in this op, samples will be filtered if their special-char ratio is below this parameter.

  • max_ratio – The max filter ratio in this op, samples will be filtered if their special-char ratio exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples)[source]
process_batched(samples)[source]
class data_juicer.ops.filter.VideoFramesTextSimilarityFilter(hf_clip='openai/clip-vit-base-patch32', trust_remote_code=False, min_score: float = 0.1, max_score: float = 1.0, frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples those similarities between sampled video frame images and text within a specific range.

__init__(hf_clip='openai/clip-vit-base-patch32', trust_remote_code=False, min_score: float = 0.1, max_score: float = 1.0, frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_clip – clip model name on huggingface to compute the similarity between frame image and text. It’s kind of language-related. For example, for Chinese datasets, ChineseCLIP might be a better choice.

  • min_score – the min similarity to keep samples.

  • max_score – the max similarity to keep samples.

  • frame_sampling_method – sampling method of extracting frame images from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • horizontal_flip – flip frame image horizontally (left to right).

  • vertical_flip – flip frame image vertically (top to bottom).

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • reduce_mode – reduce mode when one text corresponds to multiple video frame images in a chunk. ‘avg’: Take the average of multiple values ‘max’: Take the max of multiple values ‘min’: Take the min of multiple values

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageAspectRatioFilter(min_ratio: float = 0.333, max_ratio: float = 3.0, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with image aspect ratio within a specific range. AspectRatio = W / H.

__init__(min_ratio: float = 0.333, max_ratio: float = 3.0, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_ratio – The min aspect ratio to keep samples.

  • max_ratio – The max aspect ratio to keep samples.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.AudioDurationFilter(min_duration: int = 0, max_duration: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Keep data samples whose audios’ durations are within a specified range.

__init__(min_duration: int = 0, max_duration: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_duration – The min audio duration to keep samples in seconds. It’s 0 by default.

  • max_duration – The max audio duration to keep samples in seconds. It’s sys.maxsize by default.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all audios. ‘any’: keep this sample if any audios meet the condition. ‘all’: keep this sample only if all audios meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.LanguageIDScoreFilter(lang: str | List[str] = '', min_score: float = 0.8, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples in a specific language with confidence score larger than a specific min value.

__init__(lang: str | List[str] = '', min_score: float = 0.8, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – Samples in which languages to keep.

  • min_score – The min language identification confidence scores of samples to keep.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.SuffixFilter(suffixes: str | List[str] = [], *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with specified suffix.

__init__(suffixes: str | List[str] = [], *args, **kwargs)[source]

Initialization method.

Parameters:
  • suffixes – the suffix of text that will be keep. For example: ‘.txt’, ‘txt’ or [‘txt’, ‘.pdf’, ‘docx’]

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageSizeFilter(min_size: str = '0', max_size: str = '1TB', any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Keep data samples whose image size (in Bytes/KB/MB/…) within a specific range.

__init__(min_size: str = '0', max_size: str = '1TB', any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_size – The min image size to keep samples. set to be “0” by default for no size constraint

  • max_size – The max image size to keep samples. set to be “1TB” by default, an approximate for un-limited case

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoWatermarkFilter(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples whose videos have no watermark with high probability.

__init__(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_watermark_model – watermark detection model name on huggingface.

  • prob_threshold – the predicted watermark probability threshold for samples. range from 0 to 1. Samples with watermark probability less than this threshold will be kept.

  • frame_sampling_method – sampling method of extracting frame images from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • reduce_mode – reduce mode for multiple sampled video frames. ‘avg’: Take the average of multiple values ‘max’: Take the max of multiple values ‘min’: Take the min of multiple values

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.WordsNumFilter(lang: str = 'en', tokenization: bool = False, min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with total words number within a specific range.

__init__(lang: str = 'en', tokenization: bool = False, min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – sample in which language.

  • tokenization – whether to use model to tokenize documents

  • min_num – The min filter word number in this op, samples will be filtered if their word number is below this parameter.

  • max_num – The max filter word number in this op, samples will be filtered if their word number exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples, context=False)[source]
process_batched(samples)[source]
class data_juicer.ops.filter.ImageFaceCountFilter(cv_classifier: str = '', min_face_count: int = 1, max_face_count: int = 1, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with the number of faces within a specific range.

__init__(cv_classifier: str = '', min_face_count: int = 1, max_face_count: int = 1, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • cv_classifier – OpenCV classifier path for face detection. By default, we will use ‘haarcascade_frontalface_alt.xml’.

  • min_face_count – Minimum number of faces required for samples.

  • max_face_count – Maximum number of faces required for samples.

  • any_or_all – Keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – Extra positional arguments.

  • kwargs – Extra keyword arguments.

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageFaceRatioFilter(cv_classifier: str = '', min_ratio: float = 0.0, max_ratio: float = 0.4, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with face area ratios within a specific range.

__init__(cv_classifier: str = '', min_ratio: float = 0.0, max_ratio: float = 0.4, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • cv_classifier – OpenCV classifier path for face detection. By default, we will use ‘haarcascade_frontalface_alt.xml’.

  • min_ratio – Min ratio for the largest face area in an image.

  • max_ratio – Max ratio for the largest face area in an image.

  • any_or_all – Keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – Extra positional arguments.

  • kwargs – Extra keyword arguments.

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.FlaggedWordFilter(lang: str = 'en', tokenization: bool = False, max_ratio: float = 0.045, flagged_words_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[int[int]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with flagged-word ratio less than a specific max value.

__init__(lang: str = 'en', tokenization: bool = False, max_ratio: float = 0.045, flagged_words_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[int[int]] = [2], words_aug_join_char: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – Consider flagged words in what language. If lang == “all”, we will adopt the one merged from all the available languages

  • tokenization – Whether to use model to tokenize documents

  • max_ratio – The max filter ratio in this op.

  • flagged_words_dir – The directory storing the flagged_words file(s) whose name includes “flagged_words” and in json format

  • use_words_aug – Whether to augment words, especially for Chinese and Vietnamese

  • words_aug_group_sizes – The group size of words to augment

  • words_aug_join_char – The join char between words to augment

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.WordRepetitionFilter(lang: str = 'en', tokenization: bool = False, rep_len: int[int] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with word-level n-gram repetition ratio within a specific range.

__init__(lang: str = 'en', tokenization: bool = False, rep_len: int[int] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – sample in which language.

  • tokenization – whether to use model to tokenize documents

  • rep_len – Repetition length for word-level n-gram.

  • min_ratio – The min filter ratio in this op, samples will be filtered if their word-level n-gram repetition ratio is below this parameter.

  • max_ratio – The max filter ratio in this op, samples will be filtered if their word-level n-gram repetition ratio exceeds this parameter.

  • args – extra args

  • kwargs – extra args

compute_stats_batched(samples, context=False)[source]
process_batched(samples)[source]
class data_juicer.ops.filter.VideoMotionScoreFilter(min_score: float = 0.25, max_score: float = 1.7976931348623157e+308, sampling_fps: float[float] = 2, size: int[int] | Tuple[int[int]] | Tuple[int[int], int[int]] | None = None, max_size: int[int] | None = None, relative: bool = False, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with video motion scores within a specific range. The Farneback’s algorith from OpenCV is used to compute dense optical flow.

__init__(min_score: float = 0.25, max_score: float = 1.7976931348623157e+308, sampling_fps: float[float] = 2, size: int[int] | Tuple[int[int]] | Tuple[int[int], int[int]] | None = None, max_size: int[int] | None = None, relative: bool = False, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_score – The minimum motion score to keep samples.

  • max_score – The maximum motion score to keep samples.

  • sampling_fps – The sampling rate in frames_per_second for optical flow calculations.

  • size – Resize frames before computing optical flow. If size is a sequence like (h, w), frame size will be matched to this. If size is an int, smaller edge of frames will be matched to this number. i.e, if height > width, then frame will be rescaled to (size * height / width, size). Default None to keep the original size.

  • max_size – The maximum allowed for the longer edge of resized frames. If the longer edge of frames is greater than max_size after being resized according to size, size will be overruled so that the longer edge is equal to max_size. As a result, the smaller edge may be shorter than size. This is only supported if size is an int.

  • relative – If True, the optical flow magnitude is normalized to a [0, 1] range, relative to the frame’s diagonal length.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.ImagePairSimilarityFilter(hf_clip='openai/clip-vit-base-patch32', trust_remote_code=False, min_score: ClosedUnitInterval = 0.1, max_score: ClosedUnitInterval = 1.0, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep image pairs with similarities between images within a specific range.

__init__(hf_clip='openai/clip-vit-base-patch32', trust_remote_code=False, min_score: ClosedUnitInterval = 0.1, max_score: ClosedUnitInterval = 1.0, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

param hf_clip:

clip model name on huggingface to compute the similarity between image and text.

param min_score:

The min similarity to keep samples.

param max_score:

The max similarity to keep samples.

param any_or_all:

keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

param args:

extra args

param kwargs:

extra args

compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering