data_juicer.ops.mapper.video_split_by_duration_mapper module¶
- class data_juicer.ops.mapper.video_split_by_duration_mapper.VideoSplitByDurationMapper(split_duration: float = 10, min_last_split_duration: float = 0, keep_original_sample: bool = True, save_dir: str = None, *args, **kwargs)[source]¶
Bases:
Mapper
Splits videos into segments based on a specified duration.
This operator splits each video in the dataset into smaller segments, each with a fixed duration. The last segment is discarded if its duration is less than the specified minimum last split duration. The original sample can be kept or removed based on the keep_original_sample parameter. The generated video files are saved in the specified directory or, if not provided, in the same directory as the input files. The key metric for this operation is the duration of each segment, which is character-based (seconds).
Splits videos into segments of a specified duration.
Discards the last segment if it is shorter than the minimum allowed duration.
Keeps or removes the original sample based on the keep_original_sample parameter.
Saves the generated video files in the specified directory or the input file’s directory.
Uses the duration in seconds to determine the segment boundaries.
- __init__(split_duration: float = 10, min_last_split_duration: float = 0, keep_original_sample: bool = True, save_dir: str = None, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
split_duration – duration of each video split in seconds.
min_last_split_duration – The minimum allowable duration in seconds for the last video split. If the duration of the last split is less than this value, it will be discarded.
keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only cut sample in the final datasets and the original sample will be removed. It’s True in default.
save_dir – The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the DJ_PRODUCED_DATA_DIR environment variable.
args – extra args
kwargs – extra args