data_juicer.ops.mapper.video_captioning_from_video_mapper module¶

class data_juicer.ops.mapper.video_captioning_from_video_mapper.VideoCaptioningFromVideoMapper(hf_video_blip: str = 'kpyu/video-blip-opt-2.7b-ego4d', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]¶

Bases: Mapper

Generates video captions using a Hugging Face video-to-text model and sampled video frames.

This operator processes video samples to generate captions based on the provided video frames. It uses a Hugging Face video-to-text model, such as ‘kpyu/video-blip-opt-2.7b-ego4d’, to generate multiple caption candidates for each video. The number of generated captions and the strategy to keep or filter these candidates can be configured. The operator supports different frame sampling methods, including extracting all keyframes or uniformly sampling a specified number of frames. Additionally, it allows for horizontal and vertical flipping of the frames. The final output can include both the original sample and the generated captions, depending on the configuration.

__init__(hf_video_blip: str = 'kpyu/video-blip-opt-2.7b-ego4d', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]¶

Initialization method.

Parameters:

hf_video_blip – video-blip model name on huggingface to generate caption
trust_remote_code – whether to trust the remote code of HF models.
caption_num – how many candidate captions to generate for each video
keep_candidate_mode –
retain strategy for the generated $caption_num$ candidates.

’random_any’: Retain the random one from generated captions

’similar_one_simhash’: Retain the generated one that is most
similar to the original caption

’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:

keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.
prompt – a string prompt to guide the generation of video-blip model for all samples globally. It’s None in default, which means no prompt provided.
prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.
frame_sampling_method – sampling method of extracting frame videos from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.
frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.
horizontal_flip – flip frame video horizontally (left to right).
vertical_flip – flip frame video vertically (top to bottom).
args – extra args
kwargs – extra args

process_batched(samples, rank=None, context=False)[source]¶

Parameters:: samples
Returns:

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.