data_juicer.ops.mapper.video_split_by_scene_mapper module¶

data_juicer.ops.mapper.video_split_by_scene_mapper.replace_func(match, scene_counts_iter)[source]¶

class data_juicer.ops.mapper.video_split_by_scene_mapper.VideoSplitBySceneMapper(detector: str = 'ContentDetector', threshold: Annotated[float, Ge(ge=0)] = 27.0, min_scene_len: Annotated[int, Ge(ge=0)] = 15, show_progress: bool = False, save_dir: str = None, *args, **kwargs)[source]¶

Bases: Mapper

Splits videos into scene clips based on detected scene changes.

This operator uses a specified scene detector to identify and split video scenes. It supports three types of detectors: ContentDetector, ThresholdDetector, and AdaptiveDetector. The operator processes each video in the sample, detects scenes, and splits the video into individual clips. The minimum length of a scene can be set, and progress can be shown during processing. The resulting clips are saved in the specified directory or the same directory as the input files if no save directory is provided. The operator also updates the text field in the sample to reflect the new video clips. If a video does not contain any scenes, it remains unchanged.

avaliable_detectors = {'AdaptiveDetector': ['window_width', 'min_content_val', 'weights', 'luma_only', 'kernel_size', 'video_manager', 'min_delta_hsv'], 'ContentDetector': ['weights', 'luma_only', 'kernel_size'], 'ThresholdDetector': ['fade_bias', 'add_final_scene', 'method', 'block_size']}¶

__init__(detector: str = 'ContentDetector', threshold: Annotated[float, Ge(ge=0)] = 27.0, min_scene_len: Annotated[int, Ge(ge=0)] = 15, show_progress: bool = False, save_dir: str = None, *args, **kwargs)[source]¶

Initialization method.

Parameters:

detector – Algorithm from scenedetect.detectors. Should be one of [‘ContentDetector’, ‘ThresholdDetector’, ‘AdaptiveDetector`].
threshold – Threshold passed to the detector.
min_scene_len – Minimum length of any scene.
show_progress – Whether to show progress from scenedetect.
save_dir – The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.
args – extra args
kwargs – extra args

process_single(sample, context=False)[source]¶

For sample level, sample –> sample

Parameters:: sample – sample to process
Returns:: processed sample