video_split_by_duration_mapper¶

Splits videos into segments based on a specified duration.

This operator splits each video in the dataset into smaller segments, each with a fixed duration. The last segment is discarded if its duration is less than the specified minimum last split duration. The original sample can be kept or removed based on the keep_original_sample parameter. The generated video files are saved in the specified directory or, if not provided, in the same directory as the input files. The key metric for this operation is the duration of each segment, which is character-based (seconds).

Splits videos into segments of a specified duration.
Discards the last segment if it is shorter than the minimum allowed duration.
Keeps or removes the original sample based on the keep_original_sample parameter.
Saves the generated video files in the specified directory or the input file’s directory.
Uses the duration in seconds to determine the segment boundaries.

根据指定的持续时间将视频分割成片段。

此算子将数据集中的每个视频分割成较小的片段，每个片段具有固定的持续时间。如果最后一个片段的持续时间小于指定的最小最后分割持续时间，则丢弃该片段。根据 keep_original_sample 参数，可以选择保留或删除原始样本。生成的视频文件保存在指定目录中，如果没有提供目录，则保存在与输入文件相同的目录中。此操作的关键指标是每个片段的持续时间，以秒为单位。

将视频分割成指定持续时间的片段。
如果最后一个片段的持续时间短于允许的最小持续时间，则丢弃该片段。
根据 keep_original_sample 参数保留或删除原始样本。
将生成的视频文件保存在指定目录或输入文件的目录中。
使用秒数来确定片段的边界。

Type 算子类型: mapper

Tags 标签: cpu, multimodal

🔧 Parameter Configuration 参数配置¶

name 参数名	type 类型	default 默认值	desc 说明
`split_duration`	<class ‘float’>	`10`	duration of each video split in seconds.
`min_last_split_duration`	<class ‘float’>	`0`	The minimum allowable duration in seconds for the last video split. If the duration of the last split is less than this value, it will be discarded.
`keep_original_sample`	<class ‘bool’>	`True`	whether to keep the original sample. If it’s set to False, there will be only cut sample in the final datasets and the original sample will be removed. It’s True in default.
`save_dir`	<class ‘str’>	`None`	The directory where generated video files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the `DJ_PRODUCED_DATA_DIR` environment variable.
`args`		`''`	extra args
`kwargs`		`''`	extra args

📊 Effect demonstration 效果演示¶

test¶

VideoSplitByDurationMapper(split_duration=10, keep_original_sample=False)

📥 input data 输入数据¶

Sample 1: text | 1 video

<__dj__video> 白色的小羊站在一旁讲话。旁边还有两只灰色猫咪和一只拉着灰狼的猫咪。

video1.mp4:

Sample 2: text | 1 video

<__dj__video> 身穿白色上衣的男子，拿着一个东西，拍打自己的胃部。<|__dj__eoc|>

video2.mp4:

Sample 3: text | 1 video

<__dj__video> 两个长头发的女子正坐在一张圆桌前讲话互动。 <|__dj__eoc|>

video3.mp4:

📤 output data 输出数据¶

Sample 1: text

<__dj__video><__dj__video> 白色的小羊站在一旁讲话。旁边还有两只灰色猫咪和一只拉着灰狼的猫咪。<|__dj__eoc|>

split_frames_num
[2]

Sample 2: text

<__dj__video><__dj__video><__dj__video> 身穿白色上衣的男子，拿着一个东西，拍打自己的胃部。<|__dj__eoc|>

split_frames_num
[3]

Sample 3: text

<__dj__video><__dj__video><__dj__video><__dj__video><__dj__video> 两个长头发的女子正坐在一张圆桌前讲话互动。 <|__dj__eoc|>

split_frames_num
[5]

✨ explanation 解释¶

This example shows the operator splitting each video into segments of 10 seconds, and it does not keep the original samples. The output data includes the text with multiple video tokens, indicating the number of segments created. The split_frames_num in the meta field indicates how many segments were created for each input video. For clarity, the output data is processed to show the number of split frames, but the actual raw output from the operator is the segmented video files. 这个示例展示了算子将每个视频分割成10秒的片段，并且不保留原始样本。输出数据包括带有多个视频标记的文本，表示创建的片段数量。meta字段中的split_frames_num表示为每个输入视频创建了多少个片段。为了清晰起见，输出数据经过处理以显示分割帧的数量，但实际上算子的原始输出是分割后的视频文件。

test_keep_ori_sample¶

VideoSplitByDurationMapper()

📥 input data 输入数据¶

Sample 1: text | 1 video

<__dj__video> 白色的小羊站在一旁讲话。旁边还有两只灰色猫咪和一只拉着灰狼的猫咪。

video1.mp4:

Sample 2: text | 1 video

<__dj__video> 身穿白色上衣的男子，拿着一个东西，拍打自己的胃部。<|__dj__eoc|>

video2.mp4:

Sample 3: text | 1 video

<__dj__video> 两个长头发的女子正坐在一张圆桌前讲话互动。 <|__dj__eoc|>

video3.mp4:

📤 output data 输出数据¶

Sample 1: text | 1 video

<__dj__video> 白色的小羊站在一旁讲话。旁边还有两只灰色猫咪和一只拉着灰狼的猫咪。

video1.mp4:

Sample 2: text

<__dj__video><__dj__video> 白色的小羊站在一旁讲话。旁边还有两只灰色猫咪和一只拉着灰狼的猫咪。<|__dj__eoc|>

split_frames_num
[2]

Sample 3: text | 1 video

<__dj__video> 身穿白色上衣的男子，拿着一个东西，拍打自己的胃部。<|__dj__eoc|>

video2.mp4:

Sample 4: text

<__dj__video><__dj__video><__dj__video> 身穿白色上衣的男子，拿着一个东西，拍打自己的胃部。<|__dj__eoc|>

split_frames_num
[3]

Sample 5: text | 1 video

<__dj__video> 两个长头发的女子正坐在一张圆桌前讲话互动。 <|__dj__eoc|>

video3.mp4:

Sample 6: text

<__dj__video><__dj__video><__dj__video><__dj__video><__dj__video> 两个长头发的女子正坐在一张圆桌前讲话互动。 <|__dj__eoc|>

split_frames_num
[5]

✨ explanation 解释¶

This example shows the operator splitting each video into segments of 10 seconds, and it keeps the original samples. The output data includes both the original samples (with the original video paths) and the new samples (with the split_frames_num in the meta field). This demonstrates how the operator can retain the original data while also creating new segmented videos. For clarity, the output data is processed to show the number of split frames, but the actual raw output from the operator is the segmented video files. 这个示例展示了算子将每个视频分割成10秒的片段，并且保留原始样本。输出数据包括原始样本（带有原始视频路径）和新样本（meta字段中有split_frames_num）。这展示了算子如何在创建新的分割视频的同时保留原始数据。为了清晰起见，输出数据经过处理以显示分割帧的数量，但实际上算子的原始输出是分割后的视频文件。

video_split_by_duration_mapper¶

🔧 Parameter Configuration 参数配置¶

📊 Effect demonstration 效果演示¶

test¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

test_keep_ori_sample¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

🔗 related links 相关链接¶