data_juicer.ops.filter

class data_juicer.ops.filter.AlphanumericFilter(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with an alphabet/numeric ratio within a specific range.

This operator filters samples based on the ratio of alphanumeric characters or tokens. It keeps samples where the ratio of alphanumeric characters (or tokens) to the total number of characters (or tokens) is within the specified range. The ratio is computed either character-based or token-based, depending on the tokenization parameter. If tokenization is True, it uses a Hugging Face tokenizer to count tokens. The key metric used for filtering is 'alpha_token_ratio' if tokenization is enabled, otherwise 'alnum_ratio'. The operator caches these metrics in the stats field for each sample.

__init__(tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = 9223372036854775807, *args, **kwargs)[源代码]

Initialization method.

参数:
  • tokenization -- Whether to count the ratio of alphanumeric to the total number of tokens. if tokenization=False, it will count the ratio of alphanumeric to the total number of characters.

  • min_ratio -- The min filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio is below this parameter.

  • max_ratio -- The max filter ratio in alphanumeric op, samples will be filtered if their alphabet/numeric ratio exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.AudioDurationFilter(min_duration: int = 0, max_duration: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Keep data samples whose audio durations are within a specified range.

This operator filters data samples based on the duration of their audio files. It keeps samples where the audio duration is between a minimum and maximum value, in seconds. The operator supports two strategies for keeping samples: 'any' (keep if any audio meets the condition) or 'all' (keep only if all audios meet the condition). The audio duration is computed using the librosa library. If the audio duration has already been computed, it is retrieved from the sample's stats under the key 'audio_duration'. If no audio is present in the sample, an empty array is stored in the stats.

__init__(min_duration: int = 0, max_duration: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_duration -- The min audio duration to keep samples in seconds. It's 0 by default.

  • max_duration -- The max audio duration to keep samples in seconds. It's sys.maxsize by default.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all audios. 'any': keep this sample if any audios meet the condition. 'all': keep this sample only if all audios meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.AudioNMFSNRFilter(min_snr: float = 0, max_snr: float = 9223372036854775807, nmf_iter_num: Annotated[int, Gt(gt=0)] = 500, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Keep data samples whose audio Signal-to-Noise Ratios (SNRs) are within a specified range.

This operator computes the SNR of each audio in a sample using Non-negative Matrix Factorization (NMF). It then filters the samples based on whether their SNRs fall within the given minimum and maximum thresholds. The SNR is computed for each audio, and the filtering strategy can be set to either 'any' or 'all'. In 'any' mode, a sample is kept if at least one of its audios meets the SNR criteria. In 'all' mode, all audios must meet the criteria for the sample to be kept. The NMF computation uses a specified number of iterations. If no audio is present in the sample, the SNR is recorded as an empty array. The key metric is stored in the 'audio_nmf_snr' field.

__init__(min_snr: float = 0, max_snr: float = 9223372036854775807, nmf_iter_num: Annotated[int, Gt(gt=0)] = 500, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_snr -- The min audio SNR to keep samples in dB. It's 0 by default.

  • max_snr -- The max audio SNR to keep samples in dB. It's sys.maxsize by default.

  • nmf_iter_num -- The max number of iterations to run NMF. It's 500 in default.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all audios. 'any': keep this sample if any audios meet the condition. 'all': keep this sample only if all audios meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.AudioSizeFilter(min_size: str = '0', max_size: str = '1TB', any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Keep data samples based on the size of their audio files.

This operator filters data samples by checking if the size of their audio files falls within a specified range. The size can be in bytes, kilobytes, megabytes, or any other unit. The key metric used is 'audio_sizes', which is an array of file sizes in bytes. If no audio files are present, the 'audio_sizes' field will be an empty array. The operator supports two strategies for keeping samples: 'any' and 'all'. In 'any' mode, a sample is kept if at least one of its audio files meets the size criteria. In 'all' mode, all audio files must meet the size criteria for the sample to be kept.

__init__(min_size: str = '0', max_size: str = '1TB', any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_size -- The min audio size to keep samples. set to be "0" by default for no size constraint

  • max_size -- The max audio size to keep samples. set to be "1Tb" by default, an approximate for un-limited case

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all audios. 'any': keep this sample if any audios meet the condition. 'all': keep this sample only if all audios meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.AverageLineLengthFilter(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with average line length within a specific range.

This operator filters out samples based on their average line length. It keeps samples where the average line length is between the specified minimum and maximum values. The average line length is calculated as the total text length divided by the number of lines. If the context is provided, it uses precomputed lines from the context. The computed average line length is stored in the 'avg_line_length' key in the stats field.

__init__(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_len -- The min filter length in this op, samples will be filtered if their average line length is below this parameter.

  • max_len -- The max filter length in this op, samples will be filtered if their average line length exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples, context=False)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.CharacterRepetitionFilter(rep_len: Annotated[int, Gt(gt=0)] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with character-level n-gram repetition ratio within a specific range.

This operator calculates the character-level n-gram repetition ratio for each sample and filters out samples that do not fall within the specified range. The repetition ratio is computed based on the frequency of n-grams in the text. The key metric 'char_rep_ratio' is cached in the stats field. Samples are kept if their 'char_rep_ratio' is between the specified min and max ratios. The n-gram length, minimum, and maximum ratios are configurable.

__init__(rep_len: Annotated[int, Gt(gt=0)] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[源代码]

Initialization method.

参数:
  • rep_len -- Repetition length for char-level n-gram.

  • min_ratio -- The min filter ratio in this op, samples will be filtered if their char-level n-gram repetition ratio is below this parameter.

  • max_ratio -- The max filter ratio in this op, samples will be filtered if their char-level n-gram repetition ratio exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.FlaggedWordFilter(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.0, max_ratio: float = 0.045, flagged_words_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with flagged-word ratio in a specified range.

This operator filters out samples based on the ratio of flagged words. It uses a list of flagged words, which can be language-specific or combined from multiple languages. The flagged-word ratio is computed as the number of flagged words divided by the total number of words in the sample. If tokenization is enabled, a Hugging Face tokenizer is used to split the text into words. The operator supports word augmentation for certain languages, which can be configured. The key metric, 'flagged_words_ratio', is cached and reused if already computed. Samples are kept if their flagged-word ratio falls within the specified min and max ratio.

__init__(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.0, max_ratio: float = 0.045, flagged_words_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[源代码]

Initialization method.

参数:
  • lang -- Consider flagged words in what language. If lang == "all", we will adopt the one merged from all the available languages

  • tokenization -- Whether to use model to tokenize documents

  • min_ratio -- The min filter ratio in this op.

  • max_ratio -- The max filter ratio in this op.

  • flagged_words_dir -- The directory storing the flagged_words file(s) whose name includes "flagged_words" and in json format

  • use_words_aug -- Whether to augment words, especially for Chinese and Vietnamese

  • words_aug_group_sizes -- The group size of words to augment

  • words_aug_join_char -- The join char between words to augment

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples, context=False)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.ImageAestheticsFilter(hf_scorer_model: str = '', trust_remote_code: bool = False, min_score: float = 0.5, max_score: float = 1.0, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with aesthetics scores within a specific range.

This operator uses a Hugging Face model to predict the aesthetics score of images. It keeps samples where the predicted scores fall within the specified min and max score range. The operator supports two strategies: 'any' (keep if any image meets the condition) and 'all' (keep only if all images meet the condition). Aesthetics scores are cached in the 'image_aesthetics_scores' field. If no images are present, the sample is kept. Scores are normalized by dividing by 10 if the model name includes 'shunk031/aesthetics-predictor'.

__init__(hf_scorer_model: str = '', trust_remote_code: bool = False, min_score: float = 0.5, max_score: float = 1.0, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_scorer_model -- Huggingface model name for the aesthetics predictor. By default, we will use 'shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE', refer to pypi.org/project/simple-aesthetics-predictor

  • trust_remote_code -- whether to trust the remote code of HF models.

  • min_score -- Min score for the predicted aesthetics in an image.

  • max_score -- Max score for the predicted aesthetics in an image.

  • any_or_all -- Keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • args -- Extra positional arguments.

  • kwargs -- Extra keyword arguments.

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageAspectRatioFilter(min_ratio: float = 0.333, max_ratio: float = 3.0, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with image aspect ratio within a specific range.

The operator computes the aspect ratio for each image in the sample, defined as the width divided by the height (W / H). It caches the computed aspect ratios in the 'aspect_ratios' field. Samples are kept if their images' aspect ratios fall within the specified minimum and maximum range. The 'any_or_all' parameter determines the strategy: 'any' keeps samples if at least one image meets the criteria, while 'all' requires all images to meet the criteria. If no images are present in a sample, the sample is not filtered out.

__init__(min_ratio: float = 0.333, max_ratio: float = 3.0, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_ratio -- The min aspect ratio to keep samples.

  • max_ratio -- The max aspect ratio to keep samples.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples, context=False)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.ImageFaceCountFilter(cv_classifier: str = '', min_face_count: int = 1, max_face_count: int = 1, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with the number of faces within a specific range.

This operator uses an OpenCV classifier for face detection. It filters samples based on the number of faces detected in the images, keeping only those with a face count within the specified range. The operator supports two strategies: 'any' (keep if any image meets the condition) and 'all' (keep only if all images meet the condition). The face counts are cached in the 'face_counts' field. If no images are present in the sample, the face count is set to an empty array.

__init__(cv_classifier: str = '', min_face_count: int = 1, max_face_count: int = 1, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • cv_classifier -- OpenCV classifier path for face detection. By default, we will use 'haarcascade_frontalface_alt.xml'.

  • min_face_count -- Minimum number of faces required for samples.

  • max_face_count -- Maximum number of faces required for samples.

  • any_or_all -- Keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • args -- Extra positional arguments.

  • kwargs -- Extra keyword arguments.

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageFaceRatioFilter(cv_classifier: str = '', min_ratio: float = 0.0, max_ratio: float = 0.4, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with face area ratios within a specific range.

This operator filters samples based on the ratio of the largest face area to the total image area. It uses an OpenCV classifier for face detection. The key metric, 'face_ratios', is computed for each image in the sample. Samples are kept if the face area ratios fall within the specified min and max ratio range. The filtering strategy can be set to 'any' (keep if any image meets the condition) or 'all' (keep only if all images meet the condition). If no images are present in the sample, the sample is retained.

__init__(cv_classifier: str = '', min_ratio: float = 0.0, max_ratio: float = 0.4, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • cv_classifier -- OpenCV classifier path for face detection. By default, we will use 'haarcascade_frontalface_alt.xml'.

  • min_ratio -- Min ratio for the largest face area in an image.

  • max_ratio -- Max ratio for the largest face area in an image.

  • any_or_all -- Keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • args -- Extra positional arguments.

  • kwargs -- Extra keyword arguments.

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageNSFWFilter(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, min_score: float = 0.0, max_score: float = 0.5, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples whose images have nsfw scores in a specified range.

This operator uses a Hugging Face model to compute the nsfw scores for each image in a sample. It keeps samples based on the specified min_score and max_score thresholds. The operator supports two strategies: 'any' (keep the sample if any image meets the condition) or 'all' (keep the sample only if all images meet the condition). The nsfw scores are cached in the 'image_nsfw_score' field of the sample's stats.

__init__(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, min_score: float = 0.0, max_score: float = 0.5, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_nsfw_model -- nsfw detection model name on huggingface.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • min_score -- the min nsfw score threshold for samples. range from 0 to 1.

  • max_score -- the max nsfw score threshold for samples. range from 0 to 1.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.ImagePairSimilarityFilter(hf_clip='openai/clip-vit-base-patch32', trust_remote_code=False, min_score: ClosedUnitInterval = 0.1, max_score: ClosedUnitInterval = 1.0, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep image pairs with similarities between images within a specific range.

This operator uses a Hugging Face CLIP model to compute the cosine similarity between two images in each sample. It retains samples where the similarity score falls within the specified minimum and maximum thresholds. The 'any' strategy keeps a sample if any of the image pairs meet the condition, while the 'all' strategy requires all image pairs to meet the condition. The similarity scores are cached in the 'image_pair_similarity' field. Each sample must include exactly two distinct images.

__init__(hf_clip='openai/clip-vit-base-patch32', trust_remote_code=False, min_score: ClosedUnitInterval = 0.1, max_score: ClosedUnitInterval = 1.0, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_clip -- clip model name on huggingface to compute the similarity between image and text.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • min_score -- The min similarity to keep samples.

  • max_score -- The max similarity to keep samples.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageShapeFilter(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with image shape (width, height) within specific ranges.

This operator filters samples based on the width and height of images. It keeps samples where the image dimensions fall within the specified ranges. The operator supports two strategies: 'any' and 'all'. In 'any' mode, a sample is kept if at least one image meets the criteria. In 'all' mode, all images in the sample must meet the criteria for the sample to be kept. The image width and height are stored in the 'image_width' and 'image_height' fields of the sample's stats. If no images are present in the sample, the corresponding stats fields will be empty arrays.

__init__(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_width -- The min width to keep samples.

  • max_width -- The max width to keep samples.

  • min_height -- The min height to keep samples.

  • max_height -- The max height to keep samples.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageSizeFilter(min_size: str = '0', max_size: str = '1TB', any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Keep data samples whose image size (in Bytes/KB/MB/...) is within a specific range.

This operator filters data samples based on the size of their images. It keeps samples if the image sizes fall within the specified minimum and maximum size range. The operator supports two strategies: 'any'(keep the sample if any image meets the size condition) and 'all' (keep the sample only if all images meet the size condition). If no images are present in the sample, the 'image_sizes' field will be an empty array.

__init__(min_size: str = '0', max_size: str = '1TB', any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_size -- The min image size to keep samples. set to be "0" by default for no size constraint

  • max_size -- The max image size to keep samples. set to be "1TB" by default, an approximate for un-limited case

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageTextMatchingFilter(hf_blip: str = 'Salesforce/blip-itm-base-coco', trust_remote_code: bool = False, min_score: float = 0.003, max_score: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with image-text matching scores within a specific range.

This operator uses a Hugging Face BLIP model to compute the matching score between images and text. It keeps samples where the matching score falls within the specified min_score and max_score range. The key metric, image_text_matching_score, is computed for each image-text pair. If multiple images are associated with a single text, the scores can be reduced using 'avg', 'max', or 'min' modes. The operator supports horizontal and vertical flipping of images. Samples are kept based on either 'any' or 'all' strategy: 'any' keeps the sample if any image meets the condition, while 'all' keeps the sample only if all images meet the condition.

__init__(hf_blip: str = 'Salesforce/blip-itm-base-coco', trust_remote_code: bool = False, min_score: float = 0.003, max_score: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_blip -- blip model name on huggingface to compute the matching score between image and text.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • min_score -- The min matching score to keep samples.

  • max_score -- The max matching score to keep samples.

  • horizontal_flip -- Flip image horizontally (left to right).

  • vertical_flip -- Flip image vertically (top to bottom).

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • reduce_mode -- reduce mode when one text corresponds to multiple images in a chunk. 'avg': Take the average of multiple values 'max': Take the max of multiple values 'min': Take the min of multiple values

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageTextSimilarityFilter(hf_clip: str = 'openai/clip-vit-base-patch32', trust_remote_code: bool = False, min_score: float = 0.1, max_score: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with image-text similarity within a specified range.

This operator uses a Hugging Face CLIP model to compute the similarity between images and text. It retains samples where the similarity scores fall within the given range. The similarity score is computed for each image-text pair, and the final score can be reduced using 'avg', 'max', or 'min' modes. The 'any' or 'all' strategy determines if at least one or all image-text pairs must meet the similarity criteria. The key metric 'image_text_similarity' is cached in the sample's stats. Images can be flipped horizontally or vertically before computing the similarity.

__init__(hf_clip: str = 'openai/clip-vit-base-patch32', trust_remote_code: bool = False, min_score: float = 0.1, max_score: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_clip -- clip model name on huggingface to compute the similarity between image and text.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • min_score -- The min similarity to keep samples.

  • max_score -- The max similarity to keep samples.

  • horizontal_flip -- Flip image horizontally (left to right).

  • vertical_flip -- Flip image vertically (top to bottom).

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • reduce_mode -- reduce mode when one text corresponds to multiple images in a chunk. 'avg': Take the average of multiple values 'max': Take the max of multiple values 'min': Take the min of multiple values

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.ImageWatermarkFilter(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples whose images have no watermark with high probability.

This operator uses a Hugging Face watermark detection model to filter samples based on the presence of watermarks in their images. It keeps samples where the predicted watermark probability is below a specified threshold. The operator supports two strategies: 'any' (keep if any image meets the condition) and 'all' (keep only if all images meet the condition). The key metric 'image_watermark_prob' is computed for each image, representing the probability that the image contains a watermark. If no images are present in the sample, the metric is set to an empty array.

__init__(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_watermark_model -- watermark detection model name on huggingface.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • prob_threshold -- the predicted watermark probability threshold for samples. range from 0 to 1. Samples with watermark probability less than this threshold will be kept.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.LanguageIDScoreFilter(lang: str | List[str] = '', min_score: float = 0.8, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples in a specific language with a confidence score above a threshold.

This operator uses a FastText model to identify the language of each sample. It keeps samples that are in the specified language(s) and have a language identification confidence score greater than or equal to the minimum score. If no specific language is provided, it only filters based on the confidence score. The language ID and its confidence score are stored in the 'lang' and 'lang_score' fields of the sample's stats, respectively.

__init__(lang: str | List[str] = '', min_score: float = 0.8, *args, **kwargs)[源代码]

Initialization method.

参数:
  • lang -- Samples in which languages to keep.

  • min_score -- The min language identification confidence scores of samples to keep.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.InContextInfluenceFilter(valid_dataset: List[Dict] | None = None, task_desc: str = None, valid_as_demo: bool = False, n_shot: int | None = None, *args, **kwargs)[源代码]

基类:LLMPerplexityFilter

Filter to keep texts based on their in-context influence on a validation set.

This operator calculates the in-context influence of each sample by comparing perplexities with and without the sample as context. The influence score is computed as the ratio of these perplexities. If valid_as_demo is True, the score is L(A|Q) / L(A|task_desc, Q_v, A_v, Q). Otherwise, it is L(A_v|Q) / L(A_v|task_desc, Q, A, Q_v). The operator retains samples whose in-context influence score is within a specified range. The in-context influence score is stored in the 'in_context_influence' field of the sample's stats. The validation set must be prepared using the prepare_valid_feature method if not provided during initialization.

__init__(valid_dataset: List[Dict] | None = None, task_desc: str = None, valid_as_demo: bool = False, n_shot: int | None = None, *args, **kwargs)[源代码]

Initialization method.

参数:
  • valid_dataset -- The dataset to use for validation. If None, 'self.prepare_valid_feature' should be manually called before applying the filter.

  • task_desc -- The description of the validation task.

  • valid_as_demo -- If true, score = L(A|Q) / L(A|task_desc, Q_v, A_v, Q); If false, score = L(A_v|Q) L(A_v|task_desc, Q, A, Q_v).

  • n_shot -- The number of shots in validation.

compute_stats_single(sample, rank=None)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

prepare_valid_feature(dataset=None, task_desc=None, n_shot=None, *args, **kwargs)[源代码]
process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

property valid_feature_ready
class data_juicer.ops.filter.InstructionFollowingDifficultyFilter(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[源代码]

基类:LLMPerplexityFilter

Filter to keep texts based on their instruction following difficulty (IFD,

https://arxiv.org/abs/2308.12032) score.

This operator computes the IFD score for each sample, which is the ratio of the loss with and without the query. It keeps samples where the IFD score falls within a specified range. The IFD score is calculated using a Hugging Face tokenizer and model. If the IFD score is already cached in the 'ifd_score' field, it will be reused. The operator decides to keep or filter samples based on the provided minimum and maximum IFD score thresholds.

compute_stats_single(sample, rank=None)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.LLMAnalysisFilter(api_or_hf_model: str = 'gpt-4o', min_score: float = 0.5, max_score: float = 1.0, is_hf_model: bool = False, *, api_endpoint: str | None = None, response_path: str | None = None, input_keys: List[str] = ['text'], field_names: List[str] = ['Text'], system_prompt: str | None = None, input_template: str | None = None, field_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, dim_required_keys: List[str] | None = None, **kwargs)[源代码]

基类:Filter

Base filter class for leveraging LLMs to analyze and filter data samples.

This operator uses an LLM to score and tag each sample across multiple quality dimensions. It supports both API-based and Hugging Face models. The LLM evaluates the sample on clarity, relevance, usefulness, and fluency, providing scores from 1 to 5. Tags are assigned to categorize the sample, and a recommendation is made to keep, review, or discard the sample. The average score is computed based on the required dimension keys. Samples are kept if their average score falls within the specified min and max score thresholds. The key metric 'llm_analysis_score' is cached in the sample's stats.

DEFAULT_DIM_REQUIRED_KEYS = ['clarity', 'relevance', 'usefulness', 'fluency']
DEFAULT_FIELD_TEMPLATE = '**{field_name}**\n{field_data}'
DEFAULT_INPUT_TEMPLATE = "# Data\n'''\n{data}\n'''\n\n# Response\njson\n"
DEFAULT_SYSTEM_PROMPT = 'You are a meticulous data quality assessor for LLM training. Analyze each data sample across multiple quality dimensions and provide numerical scores, tags, and reasoning. Follow these guidelines:\n\n1. Evaluation Dimensions\nScore each dimension (1-5 scale: 1=lowest, 5=highest):\n- Clarity: How easy is the sample to understand?\n- Relevance: How relevant is the sample to the intended task or topic?\n- Usefulness: How helpful or valuable is the information in the sample?\n- Fluency: How natural and well-written is the sample (grammar, style)?\n\n2. Tagging:\nAssign descriptive tags to categorize the data sample (string or list of string).  Examples include:\n- "Topic": The main subject of the sample (e.g., "Machine Learning", "Historical Event").\n- "Style":  The writing style or genre (e.g., "Informational", "Narrative", "Technical").\n3. Scoring Protocol\n- Base scores and tags on concrete evidence from the text.\n- Flag samples needing human review (confidence <90%).\n- Compare with similar data points for consistency.\n- Penalize hallucination/misinformation severely (if applicable).\n\n4. Output Format\njson\n{\n  "dimension_scores": {\n    "clarity": ,\n    "relevance": ,\n    "usefulness": ,\n    "fluency":\n  },\n  "tags": {\n    "topic": ,\n    "style":\n  },\n  "flags": ["syntax_error", "insufficient_information", ...],\n  "rationale": "Concise analysis of quality dimensions and tagging decisions.",\n  "recommendation": ["keep", "review", "discard"]\n}\n\n5. Special Instructions\n- Prioritize accuracy and relevance over stylistic qualities.\n- Contextualize cultural references appropriately.\n- Clearly justify your scores, tags, and flags in the rationale.\n- Response a json dict\n\nExample Response:\n\njson\n{\n  "dimension_scores": {\n    "clarity": 4,\n    "relevance": 5,\n    "usefulness": 3,\n    "fluency": 4\n  },\n  "tags": {\n    "topic": "Artificial Intelligence",\n    "style": "Informational"\n  },\n  "flags": ["minor_grammar_issues"],\n  "rationale": "The text is highly relevant and generally well-written, but suffers from some minor grammar issues and could be more useful with additional examples.  The topic is clearly Artificial Intelligence, and the difficulty is appropriate for an intermediate audience.",\n  "recommendation": "review"\n}\n'
__init__(api_or_hf_model: str = 'gpt-4o', min_score: float = 0.5, max_score: float = 1.0, is_hf_model: bool = False, *, api_endpoint: str | None = None, response_path: str | None = None, input_keys: List[str] = ['text'], field_names: List[str] = ['Text'], system_prompt: str | None = None, input_template: str | None = None, field_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, dim_required_keys: List[str] | None = None, **kwargs)[源代码]

Initialization method.

参数:
  • api_or_hf_model -- API or huggingface model name.

  • min_score -- The min score threshold to keep the sample.

  • max_score -- The max score threshold to keep the sample.

  • is_hf_model -- If true, use Transformers for loading hugging face or local llm.

  • api_endpoint -- URL endpoint for the API.

  • response_path -- Path to extract content from the API response. Defaults to 'choices.0.message.content'.

  • input_keys -- Sub set of keys in the sample. Support data with multi fields such as 'query', 'analysis' and 'answer' in RFT data.

  • field_names -- Corresponding field names for input keys.

  • system_prompt -- System prompt for the task.

  • input_template -- Template for building the model input.

  • field_template -- Template for each field in the prompt.

  • try_num -- The number of retry attempts when there is an API call error or output parsing error.

  • enable_vllm -- If true, use VLLM for loading hugging face or local llm.

  • model_params -- Parameters for initializing the API model.

  • sampling_params -- Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}

  • dim_required_keys -- A list of keys used to calculate the average dimension score, only the dimension scores associated with these keys are used in the average calculation.

  • kwargs -- Extra keyword arguments.

build_input(sample)[源代码]
compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

generate_llm_analysis(sample, rank)[源代码]
parse_output(raw_output)[源代码]
process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.LLMQualityScoreFilter(api_or_hf_model: str = 'gpt-4o', min_score: float = 0.5, max_score: float = 1.0, is_hf_model: bool = False, *, api_endpoint: str | None = None, response_path: str | None = None, input_keys: List[str] = ['text'], field_names: List[str] = ['Text'], system_prompt: str | None = None, input_template: str | None = None, field_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, dim_required_keys: List[str] | None = None, **kwargs)[源代码]

基类:LLMAnalysisFilter

Filter to keep samples with a high quality score estimated by a language model.

This operator uses a language model to evaluate the quality of each sample across multiple dimensions, including accuracy, grammar, informativeness, and coherence. The LLM provides a numerical score for each dimension on a 1-5 scale, where 1 is the lowest and 5 is the highest. The overall quality score is used to decide whether to keep or filter out the sample based on the specified minimum and maximum score thresholds. The evaluation results are cached in the 'llm_quality_score' and 'llm_quality_record' fields. Important flags and tags from the LLM's analysis may also be stored in the sample's stats.

DEFAULT_DIM_REQUIRED_KEYS = ['accuracy', 'grammar', 'informativeness', 'coherence']
DEFAULT_SYSTEM_PROMPT = '\nYou are a meticulous data quality assessor for LLM training. Analyze each data sample across multiple quality dimensions and provide numerical scores with reasoning. Follow these guidelines:\n\n1. Evaluation Dimensions\nScore each dimension (1-5 scale: 1=lowest, 5=highest):\n- Accuracy: Factual correctness & verifiability\n- Grammar: Linguistic correctness & fluency\n- Informativeness: Depth/utility of content\n- Coherence: Logical structure & consistency\n\n2. Scoring Protocol\n- Base scores on concrete evidence from text\n- Flag samples needing human review (confidence <90%)\n- Compare with similar data points for consistency\n- Penalize hallucination/misinformation severely\n\n3. Output Format\njson\n{\n  "dimension_scores": {\n    "accuracy": ,\n    "grammar": ,\n    "informativeness": ,\n    "coherence":\n  },\n  "flags": ["syntax_error", "insufficient_information", ...],\n  "rationale": "Concise technical analysis",\n  "recommendation": ["keep", "review", "discard"]\n}\n4. Special Instructions\n- Prioritize factual integrity over stylistic qualities\n- Treat unverified medical/legal claims as high-risk\n- Contextualize cultural references appropriately\n- Response a json dict\n\nExample Response:\n\njson\n{\n  "dimension_scores": {\n    "accuracy": 2,\n    "grammar": 4,\n    "informativeness": 4,\n    "coherence": 2\n  },\n  "flags": ["accuracy_concern", "logical_confusion"],\n  "rationale": "The text provides rich information but suffers from logical confusion and lacks contextual coherence. Excellent grammatical structure offset by factual inaccuracies.",\n  "recommendation": "review"\n}\n'
compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.LLMPerplexityFilter(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with perplexity scores within a specified range, computed using a specified LLM.

This operator computes the perplexity score for each sample using a Hugging Face LLM. It then filters the samples based on whether their perplexity scores fall within the specified minimum and maximum score range. The perplexity score is calculated as the exponential of the loss value from the LLM. The operator uses a query and response template to format the input text for the LLM. If the perplexity score is not already cached in the sample's stats under the key 'llm_perplexity', it will be computed.

__init__(hf_model: str = 'Qwen/Qwen2.5-0.5B', model_params: Dict | None = None, min_score: float = 1.0, max_score: float = 100.0, query_template: str | None = None, response_template: str | None = None, *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_model -- huggingface embedding model name.

  • model_params -- Parameters for initializing the API model.

  • min_score -- Minimum perplexity score.

  • max_score -- Maximum perplexity score.

  • query_template -- Template for building the query string.

  • response_template -- Template for building the response string.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

sample_with_messages(sample, system_prompt=None)[源代码]
class data_juicer.ops.filter.LLMDifficultyScoreFilter(api_or_hf_model: str = 'gpt-4o', min_score: float = 0.5, max_score: float = 1.0, is_hf_model: bool = False, *, api_endpoint: str | None = None, response_path: str | None = None, input_keys: List[str] = ['text'], field_names: List[str] = ['Text'], system_prompt: str | None = None, input_template: str | None = None, field_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, dim_required_keys: List[str] | None = None, **kwargs)[源代码]

基类:LLMAnalysisFilter

Filter to keep samples with high difficulty scores estimated by an LLM.

This operator uses a Hugging Face LLM to evaluate the difficulty of each sample. The LLM analyzes the sample across multiple dimensions, including linguistic complexity, conceptual depth, prior knowledge, step complexity, and ambiguity. Each dimension is scored on a 1-5 scale, with 5 being the highest difficulty. The final difficulty score is computed as the average of these dimension scores. Samples are kept if their difficulty score falls within the specified range (min_score to max_score). The key metric 'llm_difficulty_score' is stored in the sample's stats, along with detailed records and flags.

DEFAULT_DIM_REQUIRED_KEYS = ['linguistic_complexity', 'conceptual_depth', 'prior_knowledge', 'step_complexity', 'ambiguity']
DEFAULT_SYSTEM_PROMPT = '\nYou are an expert pedagogical evaluator for LLM training data. Analyze each data sample through multiple difficulty lenses and provide calibrated scores with detailed reasoning. Follow these guidelines:\n\n1. Evaluation Dimensions\nRate each dimension (1-5 scale: 1=Novice-friendly, 3=Intermediate, 5=Expert-level):\n- Linguistic Complexity: Vocabulary sophistication & syntactic structures\n- Conceptual Depth: Abstraction level & theoretical requirements\n- Prior Knowledge: Required domain-specific understanding\n- Step Complexity: Problem-solving steps needed\n- Ambiguity: Multiple valid interpretations\n\n2. Output Format\njson\n{\n  "dimension_scores": {\n    "linguistic_complexity": ,\n    "conceptual_depth": ,\n    "prior_knowledge": ,\n    "step_complexity": ,\n    "ambiguity":\n  },\n  "flags": ["multistep_reasoning", "cultural_context", ...],\n  "rationale": "Technical analysis of challenge sources"\n}\n3. Special Instructions\n- Differentiate intrinsic vs. extrinsic difficulty factors\n- Account for varying cultural/educational backgrounds\n- Mark samples requiring cross-domain knowledge synthesis\n- Consider temporal aspects for time-sensitive subjects\n- Flag ambiguous samples needing difficulty bracketing\n- Response a json dict\n\nExample Response:\n\njson\n{\n  "dimension_scores": {\n    "linguistic_complexity": 3,\n    "conceptual_depth": 5,\n    "prior_knowledge": 4,\n    "step_complexity": 4,\n    "ambiguity": 5\n  },\n  "flags": ["nonlinear_reasoning", "semantic_ambiguity"],\n  "rationale": "High conceptual difficulty due to multi-layered metaphor interpretation requiring philosophy background. Moderate linguistic complexity offset by implicit cultural references."\n}\n'
compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.LLMTaskRelevanceFilter(api_or_hf_model: str = 'gpt-4o', min_score: float = 0.5, is_hf_model: bool = False, *, valid_dataset: List[Dict] | None = None, task_desc: str | None = None, n_shot: int | None = None, **kwargs)[源代码]

基类:LLMAnalysisFilter

Filter to keep samples with high relevance scores to validation tasks estimated by an LLM.

This operator evaluates the relevance of each sample to a specified validation task using an LLM. The LLM scores the sample on multiple dimensions, including topical relevance, linguistic style match, task match, knowledge alignment, and potential utility. Each dimension is scored on a 1-5 scale, with 5 being the highest. The key metric, 'llm_task_relevance', is the average score across these dimensions. Samples are kept if their average score meets or exceeds the specified minimum threshold. The operator uses either an API or a Hugging Face model for evaluation. If no validation dataset or task description is provided, the 'prepare_valid_feature' method must be called manually before applying the filter.

DEFAULT_DIM_REQUIRED_KEYS = ['topical_relevance', 'linguistic_style_match', 'task_match', 'knowledge_alignment', 'potential_utility']
DEFAULT_SYSTEM_PROMPT = '\nYou are a meticulous data quality assessor for LLM training. Evaluate whether each data sample is beneficial for improving model performance on a downstream task.\nThe downstream task will be characterized by a task description or/and some validation data in the user query.\n\n1. Evaluation Dimensions\nScore each dimension (1-5 scale: 1=lowest, 5=highest):\n- Topical Relevance: Does the content or theme of the sample relate to those seen in the validation set?\n- Linguistic Style Match: Does the style, tone, and complexity of the sample resemble those in the validation set?\n- Task Match: If the validation examples are from a task (e.g., summarization, classification, etc.), is the sample solving a similar task?\n- Knowledge Alignment: Is the type of knowledge or reasoning required in the sample aligned with that in the validation set?\n- Potential Utility: If this sample were added to the training data, is it likely to improve generalization to the validation set?\n\n2. Output Format\njson\n{\n  "dimension_scores": {\n    "topical_relevance": ,\n    "linguistic_style_match": ,\n    "task_match": ,\n    "knowledge_alignment": ,\n    "potential_utility": ,\n  },\n  "flags": ["topical_mismatch", "task_irrelevant", ...],\n  "rationale": "Technical analysis of the relevance",\n}\n3. Special Instructions\n- Focus on **alignment with the validation examples**, not general quality.\n- If the sample is entirely unrelated to the validation set (e.g., different topic, domain, or task), assign a score of 1 and explain briefly.\n- If the validation examples are ambiguous, make a **conservative judgment** based on their shared patterns.\n- Be consistent in your rating scale across evaluations.\n- Do **not** make up or reinterpret the sample content; base all reasoning on the actual text.\n- Avoid overrating stylistically impressive but **task-irrelevant** samples.\n\nExample Response:\n\njson\n{\n  "dimension_scores": {"topical_relevance": 2, "linguistic_style_match": 4, "task_match": 2, "knowledge_alignment": 2, "potential_utility": 2},\n  "flags": ["topical_mismatch"],\n  "rationale": "The text provides rich information about American history, while the validation tasks is on multistep reasoning to solve challenging math problems."\n}\n'
__init__(api_or_hf_model: str = 'gpt-4o', min_score: float = 0.5, is_hf_model: bool = False, *, valid_dataset: List[Dict] | None = None, task_desc: str | None = None, n_shot: int | None = None, **kwargs)[源代码]

Initialization method.

参数:
  • api_or_hf_model -- API or huggingface model name.

  • min_score -- The lowest score threshold to keep the sample.

  • is_hf_model -- Indicates if the model is from HuggingFace.

  • valid_dataset -- The dataset to use for validation.

  • task_desc -- The description of the validation task. If valid_dataset=None and task_desc=None, 'self.prepare_valid_feature' should be manually called before applying the filter.

  • n_shot -- The number of shots in validation.

build_input(sample)[源代码]
compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

prepare_valid_feature(dataset=None, task_desc=None, n_shot=None, *args, **kwargs)[源代码]
process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

property valid_feature_ready
class data_juicer.ops.filter.MaximumLineLengthFilter(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with a maximum line length within a specified range.

This operator filters out samples based on the length of their longest line. It retains samples where the maximum line length is within the specified min_len and max_len range. The maximum line length is computed by splitting the text into lines and measuring the length of each line. If the context is provided, it uses precomputed lines stored under the key 'lines' in the context. The maximum line length is cached in the 'max_line_length' field of the stats.

__init__(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_len -- The min filter length in this op, samples will be filtered if their maximum line length is below this parameter.

  • max_len -- The max filter length in this op, samples will be filtered if their maximum line length exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples, context=False)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.PerplexityFilter(lang: str = 'en', min_ppl: float = 0, max_ppl: float = 1500, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with perplexity score in a specified range.

This operator computes the perplexity of text samples using a Hugging Face tokenizer and a KenLM language model. It keeps samples with perplexity scores within the specified minimum and maximum values. The perplexity is calculated character-based by default. If the perplexity is already computed, it will be reused from the 'perplexity' field in the sample's stats. The operator supports batched operations for efficiency.

__init__(lang: str = 'en', min_ppl: float = 0, max_ppl: float = 1500, *args, **kwargs)[源代码]

Initialization method.

参数:
  • lang -- Compute perplexity for samples in which language.

  • min_ppl -- The min filter perplexity in this op.

  • max_ppl -- The max filter perplexity in this op.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples, context=False)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.PhraseGroundingRecallFilter(hf_owlvit: str = 'google/owlvit-base-patch32', trust_remote_code: bool = False, min_recall: float = 0.1, max_recall: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', iou_thr: float = 0.5, large_area_ratio_thr: float = 0.95, conf_thr: float = 0.0, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples based on the phrase grounding recall of phrases extracted from text in images.

This operator uses a Hugging Face Owl-ViT model to locate phrases extracted from the text within the images. It keeps samples where the phrase grounding recall is within a specified range. The recall is computed by comparing the number of correctly located phrases to the total number of phrases. The operator can handle multiple images per text chunk and supports different strategies for reducing the recall values (e.g., average, max, min). It also allows for flipping images horizontally or vertically. The key metric 'phrase_grounding_recall' is computed and stored in the sample's stats. If no images are present, the recall is set to an empty array.

__init__(hf_owlvit: str = 'google/owlvit-base-patch32', trust_remote_code: bool = False, min_recall: float = 0.1, max_recall: float = 1.0, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', iou_thr: float = 0.5, large_area_ratio_thr: float = 0.95, conf_thr: float = 0.0, *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_owlvit -- Owl-ViT model name on huggingface to locate the phrases extracted from the text.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • min_recall -- The min phrase grounding recall to keep samples.

  • max_recall -- The max phrase grounding recall to keep samples.

  • horizontal_flip -- Flip image horizontally (left to right).

  • vertical_flip -- Flip image vertically (top to bottom).

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • reduce_mode -- reduce mode when one text corresponds to multiple images in a chunk. 'avg': Take the average of multiple values 'max': Take the max of multiple values 'min': Take the min of multiple values

  • iou_thr -- the IoU threshold for NMS-like post-process. If two predicted bboxes are overlap with an IoU larger than this threshold, the bbox with less confidence will be removed. Default: 0.5.

  • large_area_ratio_thr -- the area ratio threshold for filtering out those large predicted bboxes. If the area of a predicted bbox accounts for more than this ratio threshold of the whole image area, this bbox will be removed. Default: 0.95.

  • conf_thr -- the confidence score threshold for removing low-confidence bboxes. If the confidence score of a predicted bbox is lower than the threshold, this bbox will be removed. Default: 0.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.SpecialCharactersFilter(min_ratio: float = 0.0, max_ratio: float = 0.25, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with special-character ratio within a specific range.

This operator filters out samples based on the ratio of special characters in the text. It keeps samples where the special-character ratio is within the specified minimum and maximum thresholds. The special-character ratio is computed as the number of special characters divided by the total number of characters in the text. If the 'special_char_ratio' is already cached in the stats, it will be reused. Otherwise, it will be computed and stored in the 'special_char_ratio' field.

__init__(min_ratio: float = 0.0, max_ratio: float = 0.25, *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_ratio -- The min filter ratio in this op, samples will be filtered if their special-char ratio is below this parameter.

  • max_ratio -- The max filter ratio in this op, samples will be filtered if their special-char ratio exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.SpecifiedFieldFilter(field_key: str = '', target_value: List = [], *args, **kwargs)[源代码]

基类:Filter

Filter samples based on the specified field information.

This operator checks if the value of a specified field in each sample is within a given target value range. If the field value is not within the target range, the sample is filtered out. The field can be a multi-level key, with levels separated by dots. The target value is a list of acceptable values for the field. If the field value is not a list or tuple, it is converted to a list for comparison. Samples are retained if all values in the field match any of the target values.

  • Uses the 'field_key' and 'target_value' parameters.

  • Supports multi-level field keys, e.g., 'level1.level2'.

  • Converts non-list/tuple field values to a list for comparison.

__init__(field_key: str = '', target_value: List = [], *args, **kwargs)[源代码]

Initialization method.

参数:
  • field_key -- Filter based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.

  • target_value -- The range of specified field information corresponding to the samples that need to be retained.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.SpecifiedNumericFieldFilter(field_key: str = '', min_value: float = -9223372036854775807, max_value: float = 9223372036854775807, *args, **kwargs)[源代码]

基类:Filter

Filter samples based on a specified numeric field value.

This operator filters out samples if the numeric value in the specified field is not within the given range. The field can be multi-level, with keys separated by dots. The sample is kept if the numeric value is between the minimum and maximum values, inclusive. If the field key is not provided, all samples are retained. The operator ensures that the field exists in the sample and that its value is numeric before performing the comparison.

  • Uses the 'min_value' and 'max_value' to define the acceptable range.

  • Supports multi-level fields using dot-separated keys.

  • Returns False for non-numeric or out-of-range values, filtering the sample.

__init__(field_key: str = '', min_value: float = -9223372036854775807, max_value: float = 9223372036854775807, *args, **kwargs)[源代码]

Initialization method.

参数:
  • field_key -- Filter based on the specified numeric value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.

  • min_value -- The min filter value in SpecifiedNumericField op, samples will be filtered if their specified numeric field value is below this parameter.

  • max_value -- The max filter value in SpecifiedNumericField op, samples will be filtered if their specified numeric field value exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.StopWordsFilter(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with stopword ratio within a specified range.

This operator calculates the ratio of stopwords in a sample and keeps samples where this ratio is between the specified minimum and maximum values. The stopword ratio is computed as the number of stopwords divided by the total number of words. If the tokenization parameter is set, a Hugging Face tokenizer is used to tokenize the text. The stopwords are loaded from a directory, and if the language is set to "all", it merges stopwords from all available languages. The key metric is stopwords_ratio, which is character-based by default. The operator also supports word augmentation for specific languages.

__init__(lang: str = 'en', tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = '/home/runner/.cache/data_juicer/assets', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] = [2], words_aug_join_char: str = '', *args, **kwargs)[源代码]

Initialization method.

参数:
  • lang -- Consider stopwords in what language. If lang == "all", we will adopt the one merged from all the available languages

  • tokenization -- whether to use model to tokenize documents

  • min_ratio -- The min filter ratio in this op.

  • max_ratio -- The max filter ratio in this op.

  • stopwords_dir -- The directory storing the stopwords file(s) whose name includes "stopwords" and in json format

  • use_words_aug -- Whether to augment words, especially for Chinese and Vietnamese

  • words_aug_group_sizes -- The group size of words to augment

  • words_aug_join_char -- The join char between words to augment

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.SuffixFilter(suffixes: str | List[str] = [], *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with specified suffix.

This operator retains samples that have a suffix matching any of the provided suffixes. If no suffixes are specified, all samples are kept. The key metric 'keep' is computed based on whether the sample's suffix matches the specified list. The 'suffix' field of each sample is checked against the list of allowed suffixes. If the suffix matches, the sample is kept; otherwise, it is filtered out.

__init__(suffixes: str | List[str] = [], *args, **kwargs)[源代码]

Initialization method.

参数:
  • suffixes -- the suffix of text that will be keep. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.TextActionFilter(lang: str = 'en', min_action_num: int = 1, *args, **kwargs)[源代码]

基类:Filter

Filter to keep texts that contain a minimum number of actions.

This operator uses a Spacy model to detect actions in the text. It keeps samples if the number of detected actions meets or exceeds the specified minimum. The supported languages are English ('en') and Chinese ('zh'). The 'num_action' statistic is computed and cached for each sample. Actions are identified based on part-of-speech (POS) tags and specific tags for verbs.

__init__(lang: str = 'en', min_action_num: int = 1, *args, **kwargs)[源代码]

Initialization method.

参数:
  • lang -- language of the text in the samples. 'en' for detection of actions in English and 'zh' for detection of actions in Chinese.

  • min_action_num -- The min action number in the filtering. samples will be filtered if their action number in the text is below this parameter.

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.TextEmbdSimilarityFilter(api_or_hf_model: str = 'text-embedding-v4', is_hf_model: bool = False, api_endpoint: str = 'embeddings', response_path: str = 'data.0.embedding', model_params: Dict | None = None, min_score: ClosedUnitInterval = 0.1, max_score: ClosedUnitInterval = 1.0, valid_dataset: List[Dict] | None = None, ebd_dim: int = 4096, pooling: str | None = None, input_template: str | None = None, *args, **kwargs)[源代码]

基类:Filter

Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range.

This operator computes the cosine similarity between the text embeddings and a set of validation text embeddings. It keeps samples where the average similarity score is within the specified range. The key metric, 'text_embd_similarity', is computed as the mean cosine similarity. The operator supports both API-based and Hugging Face model- based embeddings. If no valid dataset is provided, the prepare_valid_feature method must be called manually before applying the filter.

__init__(api_or_hf_model: str = 'text-embedding-v4', is_hf_model: bool = False, api_endpoint: str = 'embeddings', response_path: str = 'data.0.embedding', model_params: Dict | None = None, min_score: ClosedUnitInterval = 0.1, max_score: ClosedUnitInterval = 1.0, valid_dataset: List[Dict] | None = None, ebd_dim: int = 4096, pooling: str | None = None, input_template: str | None = None, *args, **kwargs)[源代码]

Initialization method.

参数:
  • api_or_hf_model -- API or huggingface embedding model name.

  • is_hf_model -- Indicates if the model is from HuggingFace.

  • api_endpoint -- Embedding URL endpoint for the API.

  • response_path -- Path to extract content from the API response. Defaults to 'data.0.embedding' for embedding model.

  • model_params -- Parameters for initializing the API model.

  • min_score -- The min average similarity to keep samples.

  • max_score -- The max average similarity to keep samples.

  • valid_dataset -- The dataset to use for validation. If None, 'self.prepare_valid_feature' should be manually called before applying the filter.

  • ebd_dim -- The embedding's dimension via API. API specific parameter, i.e., if is_hf_model=True, this parameter will not take effect.

  • pooling -- strategy to extract embedding from the hidden states. https://arxiv.org/abs/2503.01807 None: default option, the hidden state of the last token. "mean": uniform mean of hidden states. "weighted_mean": weighted mean of hidden states. https://arxiv.org/abs/2202.08904 HF_MODEL specific parameter, i.e., if is_hf_model=False, this parameter will not take effect.

  • input_template -- Template for building the model input.

compute_stats_single(sample, rank=None)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

prepare_valid_feature(dataset, n_shot=None, *args, **kwargs)[源代码]
process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

property valid_feature_ready
class data_juicer.ops.filter.TextEntityDependencyFilter(lang: str = 'en', min_dependency_num: int = 1, any_or_all: str = 'all', *args, **kwargs)[源代码]

基类:Filter

Identify and filter text samples based on entity dependencies.

This operator uses a spaCy model to detect entities in the text and evaluates their dependency relationships. It filters out samples where entities have fewer than a specified number of dependency edges. The key metric is 'num_dependency_edges', which counts the number of edges for each entity in the dependency tree. Samples with no detected entities are omitted. The operator supports 'any' or 'all' strategies: 'any' keeps samples if at least one entity meets the dependency threshold, while 'all' requires all entities to meet the threshold. Supported languages are English ('en') and Chinese ('zh').

__init__(lang: str = 'en', min_dependency_num: int = 1, any_or_all: str = 'all', *args, **kwargs)[源代码]

Initialization method.

参数:
  • lang -- language of the text in the samples. 'en' for detection of entities in English and 'zh' for detection of entities in Chinese.

  • min_dependency_num -- The min token number in the filtering. Objects is independent if their number of edges in the dependency tree is below this parameter.

  • any_or_all -- keep this sample with 'any' or 'all' strategy. 'any': keep this sample if any object is dependent. 'all': keep this sample only if all images are dependent.

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.TextLengthFilter(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with total text length within a specific range.

This operator filters out samples based on their total text length. It retains samples where the text length is between the specified minimum and maximum lengths. The text length is computed as the number of characters in the sample's text. If the 'text_len' key is already present in the sample's stats, it will be reused; otherwise, it will be computed. The operator processes samples in batches for efficiency.

__init__(min_len: int = 10, max_len: int = 9223372036854775807, *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_len -- The min text length in the filtering. samples will be filtered if their text length is below this parameter.

  • max_len -- The max text length in the filtering. samples will be filtered if their text length exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.TextPairSimilarityFilter(hf_clip='openai/clip-vit-base-patch32', trust_remote_code=False, min_score: ClosedUnitInterval = 0.1, max_score: ClosedUnitInterval = 1.0, text_key_second=None, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep text pairs with similarities within a specific range.

This operator computes the similarity between two texts in a pair using a Hugging Face CLIP model. It keeps samples where the similarity score falls within the specified min and max thresholds. The key metric, 'text_pair_similarity', is computed as the cosine similarity between the text embeddings. The operator supports two strategies for keeping samples: 'any' (keep if any pair meets the condition) and 'all' (keep only if all pairs meet the condition). If the second text key is not provided, the operator will raise an error. The similarity scores are cached under the 'text_pair_similarity' field in the sample's stats.

__init__(hf_clip='openai/clip-vit-base-patch32', trust_remote_code=False, min_score: ClosedUnitInterval = 0.1, max_score: ClosedUnitInterval = 1.0, text_key_second=None, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_clip -- clip model name on huggingface to compute the similarity between image and text.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • min_score -- The min similarity to keep samples.

  • max_score -- The max similarity to keep samples.

  • text_key_second -- used to store the other sentence in the text pair.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.TokenNumFilter(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with a total token number within a specified range.

This operator uses a Hugging Face tokenizer to count the number of tokens in each sample. It keeps samples where the token count is between the minimum and maximum thresholds. The token count is stored in the 'num_token' field of the sample's stats. If the token count is not already computed, it will be calculated using the specified tokenizer.

__init__(hf_tokenizer: str = 'EleutherAI/pythia-6.9b-deduped', min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_tokenizer -- the tokenizer name of Hugging Face tokenizers.

  • min_num -- The min filter token number in this op, samples will be filtered if their token number is below this parameter.

  • max_num -- The max filter token number in this op, samples will be filtered if their token number exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoAestheticsFilter(hf_scorer_model: str = '', trust_remote_code: bool = False, min_score: float = 0.4, max_score: float = 1.0, frame_sampling_method: str = 'uniform', frame_num: Annotated[int, Gt(gt=0)] = 3, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[源代码]

基类:Filter

Filter to keep data samples with aesthetics scores for specified frames in the videos within a specific range.

This operator evaluates the aesthetic quality of video frames using a Hugging Face model. It keeps samples where the aesthetics scores of the specified frames fall within a given range. The key metric, 'video_frames_aesthetics_score', is computed by averaging, taking the max, or min of the frame scores, depending on the reduce mode. Frame sampling can be done uniformly or by extracting all keyframes. The filter applies a 'any' or 'all' strategy to decide if a sample should be kept based on the scores of multiple videos.

__init__(hf_scorer_model: str = '', trust_remote_code: bool = False, min_score: float = 0.4, max_score: float = 1.0, frame_sampling_method: str = 'uniform', frame_num: Annotated[int, Gt(gt=0)] = 3, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_scorer_model -- Huggingface model name for the aesthetics predictor. By default, we will use 'shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE', refer to pypi.org/project/simple-aesthetics-predictor

  • trust_remote_code -- whether to trust the remote code of HF models.

  • min_score -- Min score for the predicted aesthetics in a video.

  • max_score -- Max score for the predicted aesthetics in a video.

  • frame_sampling_method -- sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames and the latter one extract specified number of frames uniformly from the video. Default: "uniform" with frame_num=3, considering that the number of keyframes can be large while their difference is usually small in terms of their aesthetics.

  • frame_num -- the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • any_or_all -- Keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • reduce_mode -- reduce mode when one sample corresponds to multiple frames, must be one of ['avg','max', 'min']. 'avg': Take the average of multiple values 'max': Take the max of multiple values 'min': Take the min of multiple values

  • args -- Extra positional arguments.

  • kwargs -- Extra keyword arguments.

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoAspectRatioFilter(min_ratio: str = '9/21', max_ratio: str = '21/9', any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with video aspect ratio within a specific range.

This operator filters samples based on the aspect ratios of their videos. It keeps samples where the video aspect ratios fall within a specified range. The aspect ratio is calculated as the width divided by the height (W / H). The operator supports two strategies for keeping samples: 'any' and 'all'. In 'any' mode, a sample is kept if at least one video meets the aspect ratio condition. In 'all' mode, all videos in the sample must meet the condition for the sample to be kept. The aspect ratios are computed and stored in the 'video_aspect_ratios' field of the sample's stats.

__init__(min_ratio: str = '9/21', max_ratio: str = '21/9', any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_ratio -- The minimum aspect ratio to keep samples, supported format is a string, such as "9:21" or "9/21".

  • max_ratio -- The maximum aspect ratio to keep samples, supported format is a string, such as "21:9" or "21/9".

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoDurationFilter(min_duration: float = 0, max_duration: float = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Keep data samples whose videos' durations are within a specified range.

This operator filters data samples based on the duration of their associated videos. It keeps samples where the video durations fall within a specified minimum and maximum range. The filtering strategy can be set to 'any' or 'all': - 'any': Keep the sample if any of its videos meet the duration criteria. - 'all': Keep the sample only if all of its videos meet the duration criteria. The video durations are computed and stored in the 'video_duration' field of the sample's stats. If no videos are present, an empty array is stored.

__init__(min_duration: float = 0, max_duration: float = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_duration -- The min video duration to keep samples in seconds. It's 0 by default.

  • max_duration -- The max video duration to keep samples in seconds. It's sys.maxsize by default.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoFramesTextSimilarityFilter(hf_clip='openai/clip-vit-base-patch32', trust_remote_code=False, min_score: float = 0.1, max_score: float = 1.0, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples based on the similarity between video frame images and text within a specific range.

This operator uses a Hugging Face CLIP model to compute the similarity between video frames and associated text. It keeps samples where the computed similarity scores fall within a specified range. The operator supports different frame sampling methods, including 'all_keyframes' and 'uniform', and allows for horizontal and vertical flipping of the frames. The similarity score is reduced using one of three modes: 'avg', 'max', or 'min'. The operator also supports two strategies for keeping samples: 'any' (keep if any video meets the condition) or 'all' (keep only if all videos meet the condition). The key metric is stored in the 'video_frames_text_similarity' field.

__init__(hf_clip='openai/clip-vit-base-patch32', trust_remote_code=False, min_score: float = 0.1, max_score: float = 1.0, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, any_or_all: str = 'any', reduce_mode: str = 'avg', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_clip -- clip model name on huggingface to compute the similarity between frame image and text. It's kind of language-related. For example, for Chinese datasets, ChineseCLIP might be a better choice.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • min_score -- the min similarity to keep samples.

  • max_score -- the max similarity to keep samples.

  • frame_sampling_method -- sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".

  • frame_num -- the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • horizontal_flip -- flip frame image horizontally (left to right).

  • vertical_flip -- flip frame image vertically (top to bottom).

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • reduce_mode -- reduce mode when one text corresponds to multiple video frame images in a chunk. 'avg': Take the average of multiple values 'max': Take the max of multiple values 'min': Take the min of multiple values

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoMotionScoreFilter(min_score: float = 0.25, max_score: float = 1.7976931348623157e+308, sampling_fps: Annotated[float, Gt(gt=0)] = 2, size: Annotated[int, Gt(gt=0)] | Tuple[Annotated[int, Gt(gt=0)]] | Tuple[Annotated[int, Gt(gt=0)], Annotated[int, Gt(gt=0)]] | None = None, max_size: Annotated[int, Gt(gt=0)] | None = None, divisible: Annotated[int, Gt(gt=0)] = 1, relative: bool = False, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with video motion scores within a specific range.

The operator uses Farneback's algorithm from OpenCV to compute dense optical flow. It calculates the average motion score for each video and retains samples based on the specified minimum and maximum score thresholds. The 'any' or 'all' strategy determines whether to keep a sample if any or all videos meet the criteria. The motion score is computed as the mean magnitude of the optical flow, which can be normalized relative to the frame's diagonal length. The stats are cached under the key 'video_motion_score'.

__init__(min_score: float = 0.25, max_score: float = 1.7976931348623157e+308, sampling_fps: Annotated[float, Gt(gt=0)] = 2, size: Annotated[int, Gt(gt=0)] | Tuple[Annotated[int, Gt(gt=0)]] | Tuple[Annotated[int, Gt(gt=0)], Annotated[int, Gt(gt=0)]] | None = None, max_size: Annotated[int, Gt(gt=0)] | None = None, divisible: Annotated[int, Gt(gt=0)] = 1, relative: bool = False, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_score -- The minimum motion score to keep samples.

  • max_score -- The maximum motion score to keep samples.

  • sampling_fps -- The sampling rate in frames_per_second for optical flow calculations.

  • size -- Resize frames before computing optical flow. If size is a sequence like (h, w), frame size will be matched to this. If size is an int, smaller edge of frames will be matched to this number. i.e, if height > width, then frame will be rescaled to (size * height / width, size). Default None to keep the original size.

  • max_size -- The maximum allowed for the longer edge of resized frames. If the longer edge of frames is greater than max_size after being resized according to size, size will be overruled so that the longer edge is equal to max_size. As a result, the smaller edge may be shorter than size. This is only supported if size is an int.

  • divisible -- The number that the dimensions must be divisible by.

  • relative -- If True, the optical flow magnitude is normalized to a [0, 1] range, relative to the frame's diagonal length.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_flow(prev_frame, curr_frame)[源代码]
compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

setup_model(rank=None)[源代码]
class data_juicer.ops.filter.VideoMotionScoreRaftFilter(min_score: float = 1.0, max_score: float = 1.7976931348623157e+308, sampling_fps: Annotated[float, Gt(gt=0)] = 2, size: Annotated[int, Gt(gt=0)] | Tuple[Annotated[int, Gt(gt=0)]] | Tuple[Annotated[int, Gt(gt=0)], Annotated[int, Gt(gt=0)]] | None = None, max_size: Annotated[int, Gt(gt=0)] | None = None, divisible: Annotated[int, Gt(gt=0)] = 8, relative: bool = False, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:VideoMotionScoreFilter

Filter to keep samples with video motion scores within a specified range.

This operator utilizes the RAFT (Recurrent All-Pairs Field Transforms) model from torchvision to predict optical flow between video frames. It keeps samples where the video motion score is within the given min and max score range. The motion score is computed based on the optical flow between frames, which is estimated using the RAFT model. The operator can sample frames at a specified FPS and apply transformations to the frames before computing the flow.

  • The RAFT model is used to estimate the optical flow.

  • Frames are preprocessed using a series of transformations including normalization and color channel flipping.

  • The motion score is calculated from the optical flow data.

  • The operator can be configured to filter based on any or all frames in the video.

  • The device for model inference (CPU or CUDA) is automatically detected and set.

For further details, refer to the official torchvision documentation: https://pytorch.org/vision/main/models/raft.html

The original paper on RAFT is available here: https://arxiv.org/abs/2003.12039

__init__(min_score: float = 1.0, max_score: float = 1.7976931348623157e+308, sampling_fps: Annotated[float, Gt(gt=0)] = 2, size: Annotated[int, Gt(gt=0)] | Tuple[Annotated[int, Gt(gt=0)]] | Tuple[Annotated[int, Gt(gt=0)], Annotated[int, Gt(gt=0)]] | None = None, max_size: Annotated[int, Gt(gt=0)] | None = None, divisible: Annotated[int, Gt(gt=0)] = 8, relative: bool = False, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_score -- The minimum motion score to keep samples.

  • max_score -- The maximum motion score to keep samples.

  • sampling_fps -- The sampling rate in frames_per_second for optical flow calculations.

  • size -- Resize frames before computing optical flow. If size is a sequence like (h, w), frame size will be matched to this. If size is an int, smaller edge of frames will be matched to this number. i.e, if height > width, then frame will be rescaled to (size * height / width, size). Default None to keep the original size.

  • max_size -- The maximum allowed for the longer edge of resized frames. If the longer edge of frames is greater than max_size after being resized according to size, size will be overruled so that the longer edge is equal to max_size. As a result, the smaller edge may be shorter than size. This is only supported if size is an int.

  • divisible -- The number that the dimensions must be divisible by.

  • relative -- If True, the optical flow magnitude is normalized to a [0, 1] range, relative to the frame's diagonal length.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_flow(prev_frame, curr_frame)[源代码]
setup_model(rank=None)[源代码]
class data_juicer.ops.filter.VideoNSFWFilter(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, min_score: float = 0.0, max_score: float = 0.5, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples whose videos have nsfw scores in a specified range.

This operator uses a Hugging Face model to detect NSFW content in video frames. It keeps samples where the NSFW score is below a specified threshold. The operator supports two frame sampling methods: "all_keyframes" and "uniform". For "uniform", it extracts a specified number of frames. The NSFW scores are reduced using one of three modes: "avg", "max", or "min". The key metric, 'video_nsfw_score', is computed for each video and stored in the sample's stats. The operator can use either an "any" or "all" strategy to decide if a sample should be kept based on the NSFW scores of its videos.

__init__(hf_nsfw_model: str = 'Falconsai/nsfw_image_detection', trust_remote_code: bool = False, min_score: float = 0.0, max_score: float = 0.5, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_nsfw_model -- nsfw detection model name on huggingface.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • min_score -- the nsfw score threshold for samples. range from 0 to 1. Samples with nsfw score greater than this threshold will be kept.

  • max_score -- the nsfw score threshold for samples. range from 0 to 1. Samples with nsfw score less than this threshold will be kept.

  • frame_sampling_method -- sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".

  • frame_num -- the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • reduce_mode -- reduce mode for multiple sampled video frames. 'avg': Take the average of multiple values 'max': Take the max of multiple values 'min': Take the min of multiple values

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoOcrAreaRatioFilter(min_area_ratio: float = 0, max_area_ratio: float = 1.0, frame_sample_num: Annotated[int, Gt(gt=0)] = 3, languages_to_detect: str | List[str] = ['ch_sim', 'en'], any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Keep data samples whose detected text area ratios for specified frames in the video are within a specified range.

This operator filters data based on the ratio of the detected text area to the total frame area. It uses EasyOCR to detect text in the specified languages and calculates the area ratio for each sampled frame. The operator then determines whether to keep a sample based on the any or all strategy, which checks if any or all of the videos meet the specified area ratio range. The key metric, video_ocr_area_ratio, is computed as the mean of the text area ratios across the sampled frames. The number of sampled frames and the specific frames to be sampled can be configured.

__init__(min_area_ratio: float = 0, max_area_ratio: float = 1.0, frame_sample_num: Annotated[int, Gt(gt=0)] = 3, languages_to_detect: str | List[str] = ['ch_sim', 'en'], any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_area_ratio -- The min ocr area ratio to keep samples. It's 0 by default.

  • max_area_ratio -- The max ocr area ratio to keep samples. It's 1.0 by default.

  • frame_sample_num -- The number of sampled frames to calculate the ocr area ratio. If it's 1, only middle frame will be selected. If it's 2, only the first and the last frames will be selected. If it's larger than 2, in addition to the first and the last frames, other frames will be sampled evenly within the video duration.

  • languages_to_detect -- texts in which languages should be detected. Default: ['ch_sim', 'en']. Full language list can be found here: https://www.jaided.ai/easyocr/.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

get_reader(rank)[源代码]
process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoResolutionFilter(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Keep data samples whose videos' resolutions are within a specified range.

This operator filters data samples based on the resolution of the videos they contain. It keeps samples if the video resolutions fall within the defined width and height ranges. The filtering strategy can be set to 'any' or 'all': - 'any': Keeps the sample if any video meets the resolution criteria. - 'all': Keeps the sample only if all videos meet the resolution criteria.

The operator computes and caches the 'video_width' and 'video_height' for each video in the sample. If no videos are present, it sets these fields to empty arrays. These cached values are used to determine whether to keep or filter out the sample.

__init__(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_width -- The min horizontal resolution.

  • max_width -- The max horizontal resolution.

  • min_height -- The min vertical resolution.

  • max_height -- The max vertical resolution.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoTaggingFromFramesFilter(tags: List[str] = ['people'], contain: str = 'any', frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, tag_field_name: str = 'video_frame_tags', any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples whose videos contain specified tags.

This operator filters video samples based on the presence of given tags in the video frames. It uses a Hugging Face tokenizer to extract and tag frames. The filtering can be configured to require any or all of the specified tags to be present. The operator supports two frame sampling methods: "all_keyframes" and "uniform". When using "uniform", the number of frames to sample can be specified. The extracted tags are stored in the meta field with the key 'video_frame_tags' by default. The decision to keep a sample is based on whether any or all of the video frames meet the tag criteria, as specified by the 'any_or_all' parameter.

__init__(tags: List[str] = ['people'], contain: str = 'any', frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, tag_field_name: str = 'video_frame_tags', any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • tags -- a tag list to shift the videos, total tags can be found in https://github.com/xinyu1205/recognize-anything/blob/main/ram/data/ram_tag_list.txt # noqa: E501

  • contain -- require the videos containing 'any' or 'all' tags. When tags equal to [], 'all' keeps all samples, 'any' keeps no sample.

  • frame_sampling_method -- sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".

  • frame_num -- the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • tag_field_name -- the key name to store the tags in the meta field. It's "video_frame_tags" in default.

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.VideoWatermarkFilter(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples whose videos have no watermark with high probability.

This operator uses a Hugging Face watermark detection model to predict the probability of watermarks in video frames. It keeps samples where the predicted watermark probability is below a specified threshold. The key metric, 'video_watermark_prob', is computed by extracting frames from the video using a specified sampling method and then averaging, maximizing, or minimizing the probabilities based on the reduce mode. If multiple videos are present, the operator can use either an 'any' or 'all' strategy to determine if the sample should be kept. The frame sampling method can be 'all_keyframes' or 'uniform', and the reduce mode can be 'avg', 'max', or 'min'.

__init__(hf_watermark_model: str = 'amrul-hzz/watermark_detector', trust_remote_code: bool = False, prob_threshold: float = 0.8, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, reduce_mode: str = 'avg', any_or_all: str = 'any', *args, **kwargs)[源代码]

Initialization method.

参数:
  • hf_watermark_model -- watermark detection model name on huggingface.

  • trust_remote_code -- whether to trust the remote code of HF models.

  • prob_threshold -- the predicted watermark probability threshold for samples. range from 0 to 1. Samples with watermark probability less than this threshold will be kept.

  • frame_sampling_method -- sampling method of extracting frame images from the videos. Should be one of ["all_keyframes", "uniform"]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: "all_keyframes".

  • frame_num -- the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is "uniform". If it's 1, only the middle frame will be extracted. If it's 2, only the first and the last frames will be extracted. If it's larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • reduce_mode -- reduce mode for multiple sampled video frames. 'avg': Take the average of multiple values 'max': Take the max of multiple values 'min': Take the min of multiple values

  • any_or_all -- keep this sample with 'any' or 'all' strategy of all videos. 'any': keep this sample if any videos meet the condition. 'all': keep this sample only if all videos meet the condition.

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample, rank=None, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample, rank=None)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

class data_juicer.ops.filter.WordRepetitionFilter(lang: str = 'en', tokenization: bool = False, rep_len: Annotated[int, Gt(gt=0)] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with word-level n-gram repetition ratio within a specific range.

This operator calculates the word-level n-gram repetition ratio for each sample and filters out those that do not fall within the specified range. The n-gram length and the min/max ratio thresholds are configurable. If tokenization is enabled, a Hugging Face tokenizer is used to tokenize the text. The key metric, word_rep_ratio, is computed as the ratio of repeated n-grams to the total number of n-grams. This ratio is then compared against the min and max ratio thresholds to decide whether to keep or filter the sample. If the ratio is outside the specified range, the sample is filtered out.

__init__(lang: str = 'en', tokenization: bool = False, rep_len: Annotated[int, Gt(gt=0)] = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs)[源代码]

Initialization method.

参数:
  • lang -- sample in which language.

  • tokenization -- whether to use model to tokenize documents

  • rep_len -- Repetition length for word-level n-gram.

  • min_ratio -- The min filter ratio in this op, samples will be filtered if their word-level n-gram repetition ratio is below this parameter.

  • max_ratio -- The max filter ratio in this op, samples will be filtered if their word-level n-gram repetition ratio exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples, context=False)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.WordsNumFilter(lang: str = 'en', tokenization: bool = False, min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with a total word count within a specified range.

This operator filters samples based on the number of words they contain. It retains samples if their word count is within the given minimum and maximum limits. If tokenization is enabled, it uses a Hugging Face tokenizer to count words. The key metric num_words is computed and stored in the sample's stats under the num_words field. If the word count is already cached, it reuses the cached value to avoid redundant computation.

__init__(lang: str = 'en', tokenization: bool = False, min_num: int = 10, max_num: int = 9223372036854775807, *args, **kwargs)[源代码]

Initialization method.

参数:
  • lang -- sample in which language.

  • tokenization -- whether to use model to tokenize documents

  • min_num -- The min filter word number in this op, samples will be filtered if their word number is below this parameter.

  • max_num -- The max filter word number in this op, samples will be filtered if their word number exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

compute_stats_batched(samples, context=False)[源代码]
process_batched(samples)[源代码]
class data_juicer.ops.filter.GeneralFieldFilter(filter_condition: str = '', *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples based on a general field filter condition.

The filter condition is a string that can include logical operators (and/or) and chain comparisons. For example: "10 < num <= 30 and text != 'nothing here' and __dj__meta__.a == 3". The condition is evaluated for each sample, and only samples that meet the condition are kept. The result of the filter condition is stored in the sample's stats under the key 'general_field_filter_condition'. If the filter condition is empty or already computed, the sample is not re-evaluated.

__init__(filter_condition: str = '', *args, **kwargs)[源代码]

Initialization method. :param filter_condition: The filter condition as a string.

It can include logical operators (and/or) and chain comparisons. For example: "10 < num <= 30 and text != 'nothing here' and __dj__meta__.a == 3".

compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample: Dict) bool[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering