data_juicer.utils.mm_utils module

class data_juicer.utils.mm_utils.SpecialTokens[source]

Bases: object

image = '<__dj__image>'
audio = '<__dj__audio>'
video = '<__dj__video>'
eoc = '<|__dj__eoc|>'
data_juicer.utils.mm_utils.AV_STREAM_THREAD_TYPE = 'AUTO'

av stream thread type support “SLICE”, “FRAME”, “AUTO”.

“SLICE”: Decode more than one part of a single frame at once

“FRAME”: Decode more than one frame at once

“AUTO”: Using both “FRAME” and “SLICE” AUTO is faster when there are no video latency.

data_juicer.utils.mm_utils.get_special_tokens()[source]
data_juicer.utils.mm_utils.remove_special_tokens(text)[source]
data_juicer.utils.mm_utils.remove_non_special_tokens(text)[source]
data_juicer.utils.mm_utils.load_mm_bytes_from_sample(sample, mm_idx, mm_bytes_key=None, sample_idx=None)[source]
data_juicer.utils.mm_utils.load_data_with_context(sample, context, loaded_data_keys, load_func, mm_bytes_key=None, sample_idx=None)[source]

The unified loading function with contexts for multimodal data.

Parameters:
  • sample – can be a single sample or a batch of samples.

  • context – whether the context fields is activated.

  • loaded_data_keys – the data keys (paths) to load.

  • load_func – the function used to load the data.

  • mm_bytes_key – the key to store the data bytes if it exists. It’s None by default.

  • sample_idx – the index of the current sample. Used for batched samples.

data_juicer.utils.mm_utils.load_images(paths)[source]
data_juicer.utils.mm_utils.load_images_byte(paths)[source]
data_juicer.utils.mm_utils.load_image(path_or_bytes)[source]
data_juicer.utils.mm_utils.load_image_byte(path)[source]
data_juicer.utils.mm_utils.image_path_to_base64(image_path)[source]
data_juicer.utils.mm_utils.image_byte_to_base64(image_byte)[source]
data_juicer.utils.mm_utils.pil_to_opencv(pil_image)[source]
data_juicer.utils.mm_utils.detect_faces(image, detector, **extra_kwargs)[source]
data_juicer.utils.mm_utils.get_file_size(path)[source]
data_juicer.utils.mm_utils.iou(box1, box2)[source]
data_juicer.utils.mm_utils.calculate_resized_dimensions(original_size: Tuple[Annotated[int, Gt(gt=0)], Annotated[int, Gt(gt=0)]], target_size: Annotated[int, Gt(gt=0)] | Tuple[Annotated[int, Gt(gt=0)], Annotated[int, Gt(gt=0)]], max_length: int | None = None, divisible: Annotated[int, Gt(gt=0)] = 1) Tuple[int, int][source]

Resize dimensions based on specified constraints.

Parameters:
  • original_size – The original dimensions as (height, width).

  • target_size – Desired target size; can be a single integer (short edge) or a tuple (height, width).

  • max_length – Maximum allowed length for the longer edge.

  • divisible – The number that the dimensions must be divisible by.

Returns:

Resized dimensions as (height, width).

data_juicer.utils.mm_utils.load_audios(paths)[source]
data_juicer.utils.mm_utils.load_audio(path, sampling_rate=None)[source]
data_juicer.utils.mm_utils.load_videos(paths)[source]
data_juicer.utils.mm_utils.load_video(path, mode='r')[source]

Load a video using its path.

Parameters:
  • path – the path to this video.

  • mode – the loading mode. It’s “r” in default.

Returns:

a container object form PyAv library, which contains all streams in this video (video/audio/…) and can be used to decode these streams to frames.

data_juicer.utils.mm_utils.get_video_duration(input_video: str | InputContainer, video_stream_index: int = 0)[source]

Get the video’s duration from the container

Parameters:
  • input_video – the container object form PyAv library, which contains all streams in this video (video/audio/…) and can be used to decode these streams to frames.

  • video_stream_index – the video stream index to decode, default set to 0.

Returns:

duration of the video in second

data_juicer.utils.mm_utils.get_decoded_frames_from_video(input_video: str | InputContainer, video_stream_index: int = 0)[source]

Get the video’s frames from the container

Parameters:
  • input_video – the container object form PyAv library, which contains all streams in this video (video/audio/…) and can be used to decode these streams to frames.

  • video_stream_index – the video stream index to decode, default set to 0.

Returns:

an iterator of all the frames of the video

data_juicer.utils.mm_utils.cut_video_by_seconds(input_video: str | InputContainer, output_video: str, start_seconds: float, end_seconds: float | None = None)[source]

Cut a video into several segments by times in second.

Parameters:
  • input_video – the path to input video or the video container.

  • output_video – the path to output video.

  • start_seconds – the start time in second.

  • end_seconds – the end time in second. If it’s None, this function will cut the video from the start_seconds to the end of the video.

Returns:

a boolean flag indicating whether the video was successfully cut or not.

data_juicer.utils.mm_utils.process_each_frame(input_video: str | InputContainer, output_video: str, frame_func)[source]

Process each frame in video by replacing each frame by frame_func(frame).

Parameters:
  • input_video – the path to input video or the video container.

  • output_video – the path to output video.

  • frame_func – a function which inputs a frame and outputs another frame.

data_juicer.utils.mm_utils.extract_key_frames_by_seconds(input_video: str | InputContainer, duration: float = 1)[source]

Extract key frames by seconds. :param input_video: input video path or av.container.InputContainer. :param duration: duration of each video split in seconds.

data_juicer.utils.mm_utils.extract_key_frames(input_video: str | InputContainer)[source]

Extract key frames from the input video. If there is no keyframes in the video, return the first frame.

Parameters:

input_video – input video path or container.

Returns:

a list of key frames.

data_juicer.utils.mm_utils.get_key_frame_seconds(input_video: str | InputContainer)[source]

Get seconds of key frames in the input video.

data_juicer.utils.mm_utils.extract_video_frames_uniformly_by_seconds(input_video: str | InputContainer, frame_num: Annotated[int, Gt(gt=0)], duration: float = 1)[source]

Extract video frames uniformly by seconds. :param input_video: input video path or av.container.InputContainer. :param frame_num: the number of frames to be extracted uniformly from

each video split by duration.

Parameters:

duration – duration of each video split in seconds.

data_juicer.utils.mm_utils.extract_video_frames_uniformly(input_video: str | InputContainer, frame_num: Annotated[int, Gt(gt=0)])[source]

Extract a number of video frames uniformly within the video duration.

Parameters:
  • input_video – input video path or container.

  • frame_num – The number of frames to be extracted. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

Returns:

a list of extracted frames.

data_juicer.utils.mm_utils.extract_audio_from_video(input_video: str | InputContainer, output_audio: str | None = None, start_seconds: int = 0, end_seconds: int | None = None, stream_indexes: int | List[int] | None = None)[source]

Extract audio data for the given video.

Parameters:
  • input_video – input video. Can be a video path or an av.container.InputContainer.

  • output_audio – output audio path. If it’s None, the audio data won’t be written to file. If stream_indexes is not None, it will output multiple audio files with original filename and the stream indexes. Default: None.

  • start_seconds – the start seconds to extract audio data. Default: 0, which means extract from the start of the video.

  • end_seconds – the end seconds to stop extracting audio data. If it’s None, the extraction won’t stop until the end of the video. Default: None.

  • stream_indexes – there might be multiple audio streams in the video, so we need to decide which audio streams with stream_indexes will be extracted. It can be a single index or a list of indexes. If it’s None, all audio streams will be extracted. Default: None.

data_juicer.utils.mm_utils.size_to_bytes(size)[source]
data_juicer.utils.mm_utils.insert_texts_after_placeholders(original_string, placeholders, new_texts, delimiter_in_insert_pos=' ')[source]
data_juicer.utils.mm_utils.timecode_string_to_seconds(timecode: str)[source]

Convert a timecode string to the float seconds.

Parameters:

timecode – the input timecode string. Must in “HH:MM:SS.fff(fff)” format.

data_juicer.utils.mm_utils.parse_string_to_roi(roi_string, roi_type='pixel')[source]

Convert a roi string to four number x1, y1, x2, y2 stand for the region. When the type is ‘pixel’, (x1, y1), (x2, y2) are the locations of pixels in the top left corner and the bottom right corner respectively. If the roi_type is ‘ratio’, the coordinates are normalized by widths and heights.

Parameters:

roi_string – the roi string

Patam roi_type:

the roi string type

return tuple of (x1, y1, x2, y2) if roi_string is valid, else None

data_juicer.utils.mm_utils.close_video(container: InputContainer)[source]

Close the video stream and container to avoid memory leak.

Parameters:

container – the video container.