data_juicer.ops.mapper

class data_juicer.ops.mapper.VideoCaptioningFromAudioMapper(keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to caption a video according to its audio streams based on Qwen-Audio model.

__init__(keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only captioned sample in the final datasets and the original sample will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None)[source]
class data_juicer.ops.mapper.VideoTaggingFromAudioMapper(hf_ast: str = 'MIT/ast-finetuned-audioset-10-10-0.4593', trust_remote_code: bool = False, tag_field_name: str = '__dj__video_audio_tags__', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate video tags from audio streams extracted by video using the Audio Spectrogram Transformer.

__init__(hf_ast: str = 'MIT/ast-finetuned-audioset-10-10-0.4593', trust_remote_code: bool = False, tag_field_name: str = '__dj__video_audio_tags__', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_ast – path to the HF model to tag from audios.

  • trust_remote_code – whether to trust the remote code of HF models

  • tag_field_name – the field name to store the tags. It’s “__dj__video_audio_tags__” in default.

  • args – extra args

  • kwargs – extra args

process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageCaptioningFromGPT4VMapper(mode: str = 'description', api_key: str = '', max_token: int = 500, temperature: float[float] = 1.0, system_prompt: str = '', user_prompt: str = '', user_prompt_key: str | None = None, keep_original_sample: bool = True, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose texts are generated based on gpt-4-visison and the image.

__init__(mode: str = 'description', api_key: str = '', max_token: int = 500, temperature: float[float] = 1.0, system_prompt: str = '', user_prompt: str = '', user_prompt_key: str | None = None, keep_original_sample: bool = True, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • mode – mode of text generated from images, can be one of [‘resoning’, ‘description’, ‘conversation’, ‘custom’]

  • api_key – the API key to authenticate the request.

  • max_token – the maximum number of tokens to generate. Default is 500.

  • temperature – controls the randomness of the output (range from 0 to 1). Default is 0.

  • system_prompt – a string prompt used to set the context of a conversation and provide global guidance or rules for the gpt4-vision so that it can generate responses in the expected way. If mode set to custom, the parameter will be used.

  • user_prompt – a string prompt to guide the generation of gpt4-vision for each samples. It’s “” in default, which means no prompt provided.

  • uers_prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated text in the final datasets and the original text will be removed. It’s True in default.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.PunctuationNormalizationMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to normalize unicode punctuations to English punctuations in text samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveBibliographyMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to remove bibliography at the end of documents in Latex samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.SentenceSplitMapper(lang: str = 'en', *args, **kwargs)[source]

Bases: Mapper

Mapper to split text samples to sentences.

__init__(lang: str = 'en', *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – split sentence of text in which language.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.VideoSplitBySceneMapper(detector: str = 'ContentDetector', threshold: float[float] = 27.0, min_scene_len: int[int] = 15, show_progress: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to cut videos into scene clips.

avaliable_detectors = {'AdaptiveDetector': ['window_width', 'min_content_val', 'weights', 'luma_only', 'kernel_size', 'video_manager', 'min_delta_hsv'], 'ContentDetector': ['weights', 'luma_only', 'kernel_size'], 'ThresholdDetector': ['fade_bias', 'add_final_scene', 'method', 'block_size']}
__init__(detector: str = 'ContentDetector', threshold: float[float] = 27.0, min_scene_len: int[int] = 15, show_progress: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • detector – Algorithm from scenedetect.detectors. Should be one of [‘ContentDetector’, ‘ThresholdDetector’, ‘AdaptiveDetector`].

  • threshold – Threshold passed to the detector.

  • min_scene_len – Minimum length of any scene.

  • show_progress – Whether to show progress from scenedetect.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.CleanIpMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean ipv4 and ipv6 address in text samples.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern to search for within text.

  • repl – replacement string, default is empty string.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.CleanLinksMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean links like http/https/ftp in text samples.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern to search for within text.

  • repl – replacement string, default is empty string.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveHeaderMapper(drop_no_head: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove headers at the beginning of documents in Latex samples.

__init__(drop_no_head: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • drop_no_head – whether to drop sample texts without headers.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveTableTextMapper(min_col: int[int] = 2, max_col: int[int] = 20, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove table texts from text samples.

Regular expression is used to remove tables in the range of column number of tables.

__init__(min_col: int[int] = 2, max_col: int[int] = 20, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_col – The min number of columns of table to remove.

  • max_col – The max number of columns of table to remove.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.VideoRemoveWatermarkMapper(roi_strings: List[str] = ['0,0,0.1,0.1'], roi_type: str = 'ratio', roi_key: str | None = None, frame_num: int[int] = 10, min_frame_threshold: int[int] = 7, detection_method: str = 'pixel_value', *args, **kwargs)[source]

Bases: Mapper

Remove the watermarks in videos given regions.

__init__(roi_strings: List[str] = ['0,0,0.1,0.1'], roi_type: str = 'ratio', roi_key: str | None = None, frame_num: int[int] = 10, min_frame_threshold: int[int] = 7, detection_method: str = 'pixel_value', *args, **kwargs)[source]

Initialization method.

Parameters:
  • roi_strings – a given list of regions the watermarks locate. The format of each can be “x1, y1, x2, y2”, “(x1, y1, x2, y2)”, or “[x1, y1, x2, y2]”.

  • roi_type – the roi string type. When the type is ‘pixel’, (x1, y1), (x2, y2) are the locations of pixels in the top left corner and the bottom right corner respectively. If the roi_type is ‘ratio’, the coordinates are normalized by wights and heights.

  • roi_key – the key name of fields in samples to store roi_strings for each sample. It’s used for set different rois for different samples. If it’s none, use rois in parameter “roi_strings”. It’s None in default.

  • frame_num – the number of frames to be extracted uniformly from the video to detect the pixels of watermark.

  • min_frame_threshold – a coodination is considered as the location of a watermark pixel when it is that in no less min_frame_threshold frames.

  • detection_method – the method to detect the pixels of watermark. If it is ‘pixel_value’, we consider the distribution of pixel value in each frame. If it is ‘pixel_diversity’, we will consider the pixel diversity in different frames. The min_frame_threshold is useless and frame_num must be greater than 1 in ‘pixel_diversity’ mode.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.RemoveRepeatSentencesMapper(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove repeat sentences in text samples.

__init__(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lowercase – Whether to convert sample text to lower case

  • ignore_special_character – Whether to ignore special characters when judging repeated sentences. Special characters are all characters except Chinese characters, letters and numbers.

  • min_repeat_sentence_length – Sentences shorter than this length will not be deduplicated. If ignore_special_character is set to True, then special characters are not included in this length.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.ImageDiffusionMapper(hf_diffusion: str = 'CompVis/stable-diffusion-v1-4', trust_remote_code: bool = False, torch_dtype: str = 'fp32', revision: str = 'main', strength: float[float] = 0.8, guidance_scale: float = 7.5, aug_num: int[int] = 1, keep_original_sample: bool = True, caption_key: str | None = None, hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', *args, **kwargs)[source]

Bases: Mapper

Generate image by diffusion model

__init__(hf_diffusion: str = 'CompVis/stable-diffusion-v1-4', trust_remote_code: bool = False, torch_dtype: str = 'fp32', revision: str = 'main', strength: float[float] = 0.8, guidance_scale: float = 7.5, aug_num: int[int] = 1, keep_original_sample: bool = True, caption_key: str | None = None, hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_diffusion – diffusion model name on huggingface to generate the image.

  • torch_dtype – the floating point type used to load the diffusion model. Can be one of [‘fp32’, ‘fp16’, ‘bf16’]

  • revision – The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier allowed by Git.

  • strength – Indicates extent to transform the reference image. Must be between 0 and 1. image is used as a starting point and more noise is added the higher the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, added noise is maximum and the denoising process runs for the full number of iterations specified in num_inference_steps. A value of 1 essentially ignores image.

  • guidance_scale – A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

  • aug_num – The image number to be produced by stable-diffusion model.

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • caption_key – the key name of fields in samples to store captions for each images. It can be a string if there is only one image in each sample. Otherwise, it should be a list. If it’s none, ImageDiffusionMapper will produce captions for each images.

  • hf_img2seq – model name on huggingface to generate caption if caption_key is None.

process_batched(samples, rank=None, context=False)[source]

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote aug_num as $M$. the number of total samples after generation is $(1+M)Nb$.

Parameters:

samples

Returns:

class data_juicer.ops.mapper.ImageFaceBlurMapper(cv_classifier: str = '', blur_type: str = 'gaussian', radius: float[float] = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to blur faces detected in images.

__init__(cv_classifier: str = '', blur_type: str = 'gaussian', radius: float[float] = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • cv_classifier – OpenCV classifier path for face detection. By default, we will use ‘haarcascade_frontalface_alt.xml’.

  • blur_type – Type of blur kernel, including [‘mean’, ‘box’, ‘gaussian’].

  • radius – Radius of blur kernel.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoFFmpegWrappedMapper(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Bases: Mapper

Simple wrapper for FFmpeg video filters.

__init__(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • filter_name – ffmpeg video filter name.

  • filter_kwargs – keyword-arguments passed to ffmpeg filter.

  • global_args – list-arguments passed to ffmpeg command-line.

  • capture_stderr – whether to capture stderr.

  • overwrite_output – whether to overwrite output file.

  • args – extra args

  • kwargs – extra args

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ChineseConvertMapper(mode: str = 's2t', *args, **kwargs)[source]

Bases: Mapper

Mapper to convert Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji.

__init__(mode: str = 's2t', *args, **kwargs)[source]

Initialization method.

Parameters:
  • mode

    Choose the mode to convert Chinese:

    s2t: Simplified Chinese to Traditional Chinese,

    t2s: Traditional Chinese to Simplified Chinese,

    s2tw: Simplified Chinese to Traditional Chinese (Taiwan Standard),

    tw2s: Traditional Chinese (Taiwan Standard) to Simplified Chinese,

    s2hk: Simplified Chinese to Traditional Chinese (Hong Kong variant),

    hk2s: Traditional Chinese (Hong Kong variant) to Simplified Chinese,

    s2twp: Simplified Chinese to Traditional Chinese (Taiwan Standard) with Taiwanese idiom,

    tw2sp: Traditional Chinese (Taiwan Standard) to Simplified Chinese with Mainland Chinese idiom,

    t2tw: Traditional Chinese to Traditional Chinese (Taiwan Standard),

    tw2t: Traditional Chinese (Taiwan standard) to Traditional Chinese,

    hk2t: Traditional Chinese (Hong Kong variant) to Traditional Chinese,

    t2hk: Traditional Chinese to Traditional Chinese (Hong Kong variant),

    t2jp: Traditional Chinese Characters (Kyūjitai) to New Japanese Kanji,

    jp2t: New Japanese Kanji (Shinjitai) to Traditional Chinese Characters,

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.NlpcdaZhMapper(sequential: bool = False, aug_num: int[int] = 1, keep_original_sample: bool = True, replace_similar_word: bool = False, replace_homophone_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, replace_equivalent_num: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to simply augment samples in Chinese based on nlpcda library.

__init__(sequential: bool = False, aug_num: int[int] = 1, keep_original_sample: bool = True, replace_similar_word: bool = False, replace_homophone_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, replace_equivalent_num: bool = False, *args, **kwargs)[source]

Initialization method. All augmentation methods use default parameters in default. We recommend you to only use 1-3 augmentation methods at a time. Otherwise, the semantics of samples might be changed significantly. Notice: some augmentation method might not work for some special texts, so there might be no augmented texts generated.

Parameters:
  • sequential – whether combine all augmentation methods to a sequence. If it’s True, a sample will be augmented by all opened augmentation methods sequentially. If it’s False, each opened augmentation method would generate its augmented samples independently.

  • aug_num – number of augmented samples to be generated. If sequential is True, there will be total aug_num augmented samples generated. If it’s False, there will be (aug_num * #opened_aug_method) augmented samples generated.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated texts in the final datasets and the original texts will be removed. It’s True in default.

  • replace_similar_word – whether to open the augmentation method of replacing random words with their similar words in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这边一共有5种不同的数据增强方法”

  • replace_homophone_char – whether to open the augmentation method of replacing random characters with their homophones in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的濖据增强方法”

  • delete_random_char – whether to open the augmentation method of deleting random characters from the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的数据增强”

  • swap_random_char – whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的数据强增方法”

  • replace_equivalent_num – whether to open the augmentation method of replacing random numbers with their equivalent representations in the original texts. Notice: Only for numbers for now. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有伍种不同的数据增强方法”

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.OptimizeInstructionMapper(hf_model: str = 'alibaba-pai/Qwen2-7B-Instruct-Refine', trust_remote_code: bool = False, system_prompt: str | None = None, enable_vllm: bool = True, tensor_parallel_size: int | None = None, max_model_len: int | None = None, max_num_seqs: int = 256, sampling_params: Dict = {}, *args, **kwargs)[source]

Bases: Mapper

Mapper to optimize instruction. Recommended model list: [

alibaba-pai/Qwen2-1.5B-Instruct-Refine alibaba-pai/Qwen2-7B-Instruct-Refine

]

__init__(hf_model: str = 'alibaba-pai/Qwen2-7B-Instruct-Refine', trust_remote_code: bool = False, system_prompt: str | None = None, enable_vllm: bool = True, tensor_parallel_size: int | None = None, max_model_len: int | None = None, max_num_seqs: int = 256, sampling_params: Dict = {}, *args, **kwargs)[source]

Initialization method. :param hf_model: Hugginface model id. :param trust_remote_code: passed to transformers :param system_prompt: System prompt for optimize samples. :param enable_vllm: Whether to use vllm for inference acceleration. :param tensor_parallel_size: It is only valid when enable_vllm is True.

The number of GPUs to use for distributed execution with tensor parallelism.

Parameters:
  • max_model_len – It is only valid when enable_vllm is True. Model context length. If unspecified, will be automatically derived from the model config.

  • max_num_seqs – It is only valid when enable_vllm is True. Maximum number of sequences to be processed in a single iteration.

  • sampling_params – Sampling parameters for text generation. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • args – extra args

  • kwargs – extra args

process_single(sample=None, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageBlurMapper(p: float = 0.2, blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to blur images.

__init__(p: float = 0.2, blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • p – Probability of the image being blured.

  • blur_type – Type of blur kernel, including [‘mean’, ‘box’, ‘gaussian’].

  • radius – Radius of blur kernel.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.CleanCopyrightMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to clean copyright comments at the beginning of the text samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveNonChineseCharacterlMapper(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove non chinese Character in text samples.

__init__(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • keep_alphabet – whether to keep alphabet

  • keep_number – whether to keep number

  • keep_punc – whether to keep punctuation

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.VideoSplitByKeyFrameMapper(keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to split video by key frame.

__init__(keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only split sample in the final datasets and the original sample will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

get_split_key_frame(video_key, container)[source]
process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveSpecificCharsMapper(chars_to_remove: str | List[str] = '◆●■►▼▲▴∆▻▷❖♡□', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean specific chars in text samples.

__init__(chars_to_remove: str | List[str] = '◆●■►▼▲▴∆▻▷❖♡□', *args, **kwargs)[source]

Initialization method.

Parameters:
  • chars_to_remove – a list or a string including all characters that need to be removed from text.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.VideoResizeAspectRatioMapper(min_ratio: str = '9/21', max_ratio: str = '21/9', strategy: str = 'increase', *args, **kwargs)[source]

Bases: Mapper

Mapper to resize videos by aspect ratio. AspectRatio = W / H.

STRATEGY = ['decrease', 'increase']
__init__(min_ratio: str = '9/21', max_ratio: str = '21/9', strategy: str = 'increase', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_ratio – The minimum aspect ratio to enforce videos with an aspect ratio below min_ratio will be resized to match this minimum ratio. The ratio should be provided as a string in the format “9:21” or “9/21”.

  • max_ratio – The maximum aspect ratio to enforce videos with an aspect ratio above max_ratio will be resized to match this maximum ratio. The ratio should be provided as a string in the format “21:9” or “21/9”.

  • strategy – The resizing strategy to apply when adjusting the video dimensions. It can be either ‘decrease’ to reduce the dimension or ‘increase’ to enlarge it. Accepted values are [‘decrease’, ‘increase’].

  • args – extra args

  • kwargs – extra args

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.CleanHtmlMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to clean html code in text samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.WhitespaceNormalizationMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to normalize different kinds of whitespaces to whitespace ‘ ‘ (0x20) in text samples.

Different kinds of whitespaces can be found here: https://en.wikipedia.org/wiki/Whitespace_character

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.VideoTaggingFromFramesMapper(frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, tag_field_name: str = '__dj__video_frame_tags__', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate video tags from frames extract by video.

__init__(frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, tag_field_name: str = '__dj__video_frame_tags__', *args, **kwargs)[source]

Initialization method.

Parameters:
  • frame_sampling_method – sampling method of extracting frame images from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • tag_field_name – the field name to store the tags. It’s “__dj__video_frame_tags__” in default.

  • args – extra args

  • kwargs – extra args

process_single(sample, rank=None, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.RemoveCommentsMapper(doc_type: str | List[str] = 'tex', inline: bool = True, multiline: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove comments in different kinds of documents.

Only support ‘tex’ for now.

__init__(doc_type: str | List[str] = 'tex', inline: bool = True, multiline: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • doc_type – Type of document to remove comments.

  • inline – Whether to remove inline comments.

  • multiline – Whether to remove multiline comments.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.ExpandMacroMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to expand macro definitions in the document body of Latex samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.ExtractQAMapper(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', trust_remote_code: bool = False, pattern: str | None = None, qa_format: str = 'chatml', enable_vllm: bool = True, tensor_parallel_size: int | None = None, max_model_len: int | None = None, max_num_seqs: int = 256, sampling_params: Dict = {}, *args, **kwargs)[source]

Bases: Mapper

Mapper to extract question and answer pair from text samples. Recommended model list: [

‘alibaba-pai/pai-llama3-8b-doc2qa’, ‘alibaba-pai/pai-baichuan2-7b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-4b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-7b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-1b8-doc2qa’, ‘alibaba-pai/pai-qwen1_5-0b5-doc2qa’

] These recommended models are all trained with Chinese data and are suitable for Chinese.

__init__(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', trust_remote_code: bool = False, pattern: str | None = None, qa_format: str = 'chatml', enable_vllm: bool = True, tensor_parallel_size: int | None = None, max_model_len: int | None = None, max_num_seqs: int = 256, sampling_params: Dict = {}, *args, **kwargs)[source]

Initialization method. :param hf_model: Hugginface model id. :param trust_remote_code: passed to transformers :param pattern: regular expression pattern to search for within text. :param qa_format: Output format of question and answer pair. :param enable_vllm: Whether to use vllm for inference acceleration. :param tensor_parallel_size: It is only valid when enable_vllm is True.

The number of GPUs to use for distributed execution with tensor parallelism.

Parameters:
  • max_model_len – It is only valid when enable_vllm is True. Model context length. If unspecified, will be automatically derived from the model config.

  • max_num_seqs – It is only valid when enable_vllm is True. Maximum number of sequences to be processed in a single iteration.

  • sampling_params – Sampling parameters for text generation. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • args – extra args

  • kwargs – extra args

The default data format parsed by this interface is as follows: Model Input:

蒙古国的首都是乌兰巴托(Ulaanbaatar) 冰岛的首都是雷克雅未克(Reykjavik)

Model Output:

蒙古国的首都是乌兰巴托(Ulaanbaatar) 冰岛的首都是雷克雅未克(Reykjavik) Human: 请问蒙古国的首都是哪里? Assistant: 你好,根据提供的信息,蒙古国的首都是乌兰巴托(Ulaanbaatar)。 Human: 冰岛的首都是哪里呢? Assistant: 冰岛的首都是雷克雅未克(Reykjavik)。 …

process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageCaptioningMapper(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: int[int] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose captions are generated based on another model and the figure.

__init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: int[int] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_img2seq – model name on huggingface to generate caption

  • caption_num – how many candidate captions to generate for each image

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.

  • prompt – a string prompt to guide the generation of blip2 model for all samples globally. It’s None in default, which means no prompt provided.

  • prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None)[source]

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.

Parameters:

samples

Returns:

class data_juicer.ops.mapper.RemoveWordsWithIncorrectSubstringsMapper(lang: str = 'en', tokenization: bool = False, substrings: List[str] | None = None, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove words with incorrect substrings.

__init__(lang: str = 'en', tokenization: bool = False, substrings: List[str] | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – sample in which language

  • tokenization – whether to use model to tokenize documents

  • substrings – The incorrect substrings in words.

  • args – extra args

  • kwargs – extra args

should_keep_word_with_incorrect_substrings(word, substrings)[source]
process_batched(samples)[source]
class data_juicer.ops.mapper.VideoCaptioningFromVideoMapper(hf_video_blip: str = 'kpyu/video-blip-opt-2.7b-ego4d', trust_remote_code: bool = False, caption_num: int[int] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose captions are generated based on a video-to-text model and sampled video frame.

__init__(hf_video_blip: str = 'kpyu/video-blip-opt-2.7b-ego4d', trust_remote_code: bool = False, caption_num: int[int] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_video_blip – video-blip model name on huggingface to generate caption

  • caption_num – how many candidate captions to generate for each video

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.

  • prompt – a string prompt to guide the generation of video-blip model for all samples globally. It’s None in default, which means no prompt provided.

  • prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • frame_sampling_method – sampling method of extracting frame videos from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • horizontal_flip – flip frame video horizontally (left to right).

  • vertical_flip – flip frame video vertically (top to bottom).

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None, context=False)[source]
Parameters:

samples

Returns:

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.

class data_juicer.ops.mapper.VideoCaptioningFromSummarizerMapper(hf_summarizer: str | None = None, trust_remote_code: bool = False, consider_video_caption_from_video: bool = True, consider_video_caption_from_audio: bool = True, consider_video_caption_from_frames: bool = True, consider_video_tags_from_audio: bool = True, consider_video_tags_from_frames: bool = True, vid_cap_from_vid_args: Dict | None = None, vid_cap_from_frm_args: Dict | None = None, vid_tag_from_aud_args: Dict | None = None, vid_tag_from_frm_args: Dict | None = None, keep_tag_num: int[int] = 5, keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, …)

__init__(hf_summarizer: str | None = None, trust_remote_code: bool = False, consider_video_caption_from_video: bool = True, consider_video_caption_from_audio: bool = True, consider_video_caption_from_frames: bool = True, consider_video_tags_from_audio: bool = True, consider_video_tags_from_frames: bool = True, vid_cap_from_vid_args: Dict | None = None, vid_cap_from_frm_args: Dict | None = None, vid_tag_from_aud_args: Dict | None = None, vid_tag_from_frm_args: Dict | None = None, keep_tag_num: int[int] = 5, keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_summarizer – the summarizer model used to summarize texts generated by other methods.

  • consider_video_caption_from_video – whether to consider the video caption generated from video directly in the summarization process. Default: True.

  • consider_video_caption_from_audio – whether to consider the video caption generated from audio streams in the video in the summarization process. Default: True.

  • consider_video_caption_from_frames – whether to consider the video caption generated from sampled frames from the video in the summarization process. Default: True.

  • consider_video_tags_from_audio – whether to consider the video tags generated from audio streams in the video in the summarization process. Default: True.

  • consider_video_tags_from_frames – whether to consider the video tags generated from sampled frames from the video in the summarization process. Default: True.

  • vid_cap_from_vid_args – the arg dict for video captioning from video directly with keys are the arg names and values are the arg values. Default: None.

  • vid_cap_from_frm_args – the arg dict for video captioning from sampled frames from the video with keys are the arg names and values are the arg values. Default: None.

  • vid_tag_from_aud_args – the arg dict for video tagging from audio streams in the video with keys are the arg names and values are the arg values. Default: None.

  • vid_tag_from_frm_args – the arg dict for video tagging from sampled frames from the video with keys are the arg names and values are the arg values. Default: None.

  • keep_tag_num – max number N of tags from sampled frames to keep. Too many tags might bring negative influence to summarized text, so we consider to only keep the N most frequent tags. Default: 5.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only summarized captions in the final datasets and the original captions will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None)[source]
class data_juicer.ops.mapper.GenerateInstructionMapper(hf_model: str = 'Qwen/Qwen-7B-Chat', seed_file: str = '', instruct_num: int[int] = 3, trust_remote_code: bool = False, similarity_threshold: float = 0.7, prompt_template: str | None = None, qa_pair_template: str | None = None, example_template: str | None = None, qa_extraction_pattern: str | None = None, enable_vllm: bool = True, tensor_parallel_size: int | None = None, max_model_len: int | None = None, max_num_seqs: int = 256, sampling_params: Dict = {}, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate new instruction text data. You should configure an empty dataset in your yaml config file: ``` generated_dataset_config:

type: ‘EmptyFormatter’ # use RayEmptyFormatter when enable ray length: ${The number of generated samples} feature_keys: ${text key}

``` The number of samples generated is determined by the length of the empty dataset.

__init__(hf_model: str = 'Qwen/Qwen-7B-Chat', seed_file: str = '', instruct_num: int[int] = 3, trust_remote_code: bool = False, similarity_threshold: float = 0.7, prompt_template: str | None = None, qa_pair_template: str | None = None, example_template: str | None = None, qa_extraction_pattern: str | None = None, enable_vllm: bool = True, tensor_parallel_size: int | None = None, max_model_len: int | None = None, max_num_seqs: int = 256, sampling_params: Dict = {}, *args, **kwargs)[source]

Initialization method.

param hf_model:

Hugginface model id.

param seed_file:

Seed file path, chatml format.

param instruct_num:

The number of instruction samples. Randomly select N samples from “seed_file” and put them into prompt as instruction samples.

param trust_remote_code:

passed to transformers

param similarity_threshold:

The similarity score threshold between the generated samples and the seed samples. Range from 0 to 1. Samples with similarity score less than this threshold will be kept.

param prompt_template:

Prompt template for generate samples. Please make sure the template contains “{augmented_data}”, which corresponds to the augmented samples.

param qa_pair_template:

Prompt template for generate question and answer pair description. Please make sure the template contains two “{}” to format question and answer. Default: ‘【问题】

{} 【回答】 {} ‘.

param example_template:

Prompt template for generate examples. Please make sure the template contains “{qa_pairs}”, which corresponds to the question and answer pair description generated by param qa_pair_template. Default: ‘

如下是一条示例数据:

{qa_pairs}’
param qa_extraction_pattern:

Regular expression pattern for parsing question and answer from model response.

param enable_vllm:

Whether to use vllm for inference acceleration.

param tensor_parallel_size:

It is only valid when enable_vllm is True. The number of GPUs to use for distributed execution with tensor parallelism.

param max_model_len:

It is only valid when enable_vllm is True. Model context length. If unspecified, will be automatically derived from the model config.

param max_num_seqs:

It is only valid when enable_vllm is True. Maximum number of sequences to be processed in a single iteration.

param sampling_params:

Sampling parameters for text generation. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

param args:

extra args

param kwargs:

extra args

load_seed_qa_samples(seed_file)[source]

Load QA pairs from chatml format file.

build_prompt(qa_samples, prompt_template)[source]
parse_chatml_str(input_str)[source]
parse_response(response_str)[source]
max_rouge_l_score(reference, candidates)[source]
process_single(sample=None, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.FixUnicodeMapper(normalization: str | None = None, *args, **kwargs)[source]

Bases: Mapper

Mapper to fix unicode errors in text samples.

__init__(normalization: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • normalization – the specified form of Unicode normalization mode, which can be one of [‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’], default ‘NFC’.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.NlpaugEnMapper(sequential: bool = False, aug_num: int[int] = 1, keep_original_sample: bool = True, delete_random_word: bool = False, swap_random_word: bool = False, spelling_error_word: bool = False, split_random_word: bool = False, keyboard_error_char: bool = False, ocr_error_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, insert_random_char: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to simply augment samples in English based on nlpaug library.

__init__(sequential: bool = False, aug_num: int[int] = 1, keep_original_sample: bool = True, delete_random_word: bool = False, swap_random_word: bool = False, spelling_error_word: bool = False, split_random_word: bool = False, keyboard_error_char: bool = False, ocr_error_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, insert_random_char: bool = False, *args, **kwargs)[source]

Initialization method. All augmentation methods use default parameters in default. We recommend you to only use 1-3 augmentation methods at a time. Otherwise, the semantics of samples might be changed significantly.

Parameters:
  • sequential – whether combine all augmentation methods to a sequence. If it’s True, a sample will be augmented by all opened augmentation methods sequentially. If it’s False, each opened augmentation method would generate its augmented samples independently.

  • aug_num – number of augmented samples to be generated. If sequential is True, there will be total aug_num augmented samples generated. If it’s False, there will be (aug_num * #opened_aug_method) augmented samples generated.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated texts in the final datasets and the original texts will be removed. It’s True in default.

  • delete_random_word – whether to open the augmentation method of deleting random words from the original texts. e.g. “I love LLM” –> “I LLM”

  • swap_random_word – whether to open the augmentation method of swapping random contiguous words in the original texts. e.g. “I love LLM” –> “Love I LLM”

  • spelling_error_word – whether to open the augmentation method of simulating the spelling error for words in the original texts. e.g. “I love LLM” –> “Ai love LLM”

  • split_random_word – whether to open the augmentation method of splitting words randomly with whitespaces in the original texts. e.g. “I love LLM” –> “I love LL M”

  • keyboard_error_char – whether to open the augmentation method of simulating the keyboard error for characters in the original texts. e.g. “I love LLM” –> “I ;ov4 LLM”

  • ocr_error_char – whether to open the augmentation method of simulating the OCR error for characters in the original texts. e.g. “I love LLM” –> “I 10ve LLM”

  • delete_random_char – whether to open the augmentation method of deleting random characters from the original texts. e.g. “I love LLM” –> “I oe LLM”

  • swap_random_char – whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. “I love LLM” –> “I ovle LLM”

  • insert_random_char – whether to open the augmentation method of inserting random characters into the original texts. e.g. “I love LLM” –> “I ^lKove LLM”

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.VideoCaptioningFromFramesMapper(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: int[int] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated to a single string.

__init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: int[int] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: int[int] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_img2seq – model name on huggingface to generate caption

  • caption_num – how many candidate captions to generate for each video

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.

  • prompt – a string prompt to guide the generation of image-to-text model for all samples globally. It’s None in default, which means no prompt provided.

  • prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • frame_sampling_method – sampling method of extracting frame videos from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • horizontal_flip – flip frame video horizontally (left to right).

  • vertical_flip – flip frame video vertically (top to bottom).

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None, context=False)[source]
Parameters:

samples

Returns:

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.

class data_juicer.ops.mapper.RemoveLongWordsMapper(min_len: int = 1, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove long words within a specific range.

__init__(min_len: int = 1, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_len – The min mapper word length in this op, words will be filtered if their length is below this parameter.

  • max_len – The max mapper word length in this op, words will be filtered if their length exceeds this parameter.

  • args – extra args

  • kwargs – extra args

should_keep_long_word(word)[source]
process_batched(samples)[source]
class data_juicer.ops.mapper.VideoResizeResolutionMapper(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, force_original_aspect_ratio: str = 'disable', force_divisible_by: int[int] = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to resize videos resolution. We leave the super resolution with deep learning for future works.

__init__(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, force_original_aspect_ratio: str = 'disable', force_divisible_by: int[int] = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_width – Videos with width less than ‘min_width’ will be mapped to videos with equal or bigger width.

  • max_width – Videos with width more than ‘max_width’ will be mapped to videos with equal of smaller width.

  • min_height – Videos with height less than ‘min_height’ will be mapped to videos with equal or bigger height.

  • max_height – Videos with height more than ‘max_height’ will be mapped to videos with equal or smaller height.

  • force_original_aspect_ratio – Enable decreasing or increasing output video width or height if necessary to keep the original aspect ratio, including [‘disable’, ‘decrease’, ‘increase’].

  • force_divisible_by – Ensures that both the output dimensions, width and height, are divisible by the given integer when used together with force_original_aspect_ratio, must be a positive even number.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.CleanEmailMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean email in text samples.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern to search for within text.

  • repl – replacement string, default is empty string.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.ReplaceContentMapper(pattern: str | List[str] | None = None, repl: str | List[str] = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to replace all content in the text that matches a specific regular expression pattern with a designated replacement string.

__init__(pattern: str | List[str] | None = None, repl: str | List[str] = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern(s) to search for within text

  • repl – replacement string(s), default is empty string

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.AudioFFmpegWrappedMapper(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Bases: Mapper

Simple wrapper for FFmpeg audio filters.

__init__(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • filter_name – ffmpeg audio filter name.

  • filter_kwargs – keyword-arguments passed to ffmpeg filter.

  • global_args – list-arguments passed to ffmpeg command-line.

  • capture_stderr – whether to capture stderr.

  • overwrite_output – whether to overwrite output file.

  • args – extra args

  • kwargs – extra args

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoSplitByDurationMapper(split_duration: float = 10, min_last_split_duration: float = 0, keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to split video by duration.

__init__(split_duration: float = 10, min_last_split_duration: float = 0, keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • split_duration – duration of each video split in seconds.

  • min_last_split_duration – The minimum allowable duration in seconds for the last video split. If the duration of the last split is less than this value, it will be discarded.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only cut sample in the final datasets and the original sample will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

split_videos_by_duration(video_key, container)[source]
process_batched(samples)[source]
class data_juicer.ops.mapper.VideoFaceBlurMapper(cv_classifier: str = '', blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to blur faces detected in videos.

__init__(cv_classifier: str = '', blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • cv_classifier – OpenCV classifier path for face detection. By default, we will use ‘haarcascade_frontalface_alt.xml’.

  • blur_type – Type of blur kernel, including [‘mean’, ‘box’, ‘gaussian’].

  • radius – Radius of blur kernel.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageTaggingMapper(tag_field_name: str = '__dj__image_tags__', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate image tags.

__init__(tag_field_name: str = '__dj__image_tags__', *args, **kwargs)[source]

Initialization method. :param tag_field_name: the field name to store the tags. It’s

“__dj__image_tags__” in default.

Parameters:
  • args – extra args

  • kwargs – extra args

process_single(sample, rank=None, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample