video_captioning_from_frames_mapper

Generates video captions from sampled frames using an image-to-text model. Captions from different frames are concatenated into a single string.

  • Uses a Hugging Face image-to-text model to generate captions for sampled video frames.

  • Supports different frame sampling methods: ‘all_keyframes’ or ‘uniform’.

  • Can apply horizontal and vertical flips to the frames before captioning.

  • Offers multiple strategies for retaining generated captions: ‘random_any’, ‘similar_one_simhash’, or ‘all’.

  • Optionally keeps the original sample in the final dataset.

  • Allows setting a global prompt or per-sample prompts to guide caption generation.

  • Generates a specified number of candidate captions per video, which can be reduced based on the selected retention strategy.

  • The number of output samples depends on the retention strategy and whether original samples are kept.

使用图像到文本模型从采样的帧中生成视频字幕。来自不同帧的字幕被连接成一个字符串。

  • 使用Hugging Face图像到文本模型为采样的视频帧生成字幕。

  • 支持不同的帧采样方法:’all_keyframes’或’uniform’。

  • 可以在字幕前对帧进行水平和垂直翻转。

  • 提供多种保留生成字幕的策略:’random_any’、’similar_one_simhash’或’all’。

  • 可选地在最终数据集中保留原始样本。

  • 允许设置全局提示或每个样本的提示来指导字幕生成。

  • 为每个视频生成指定数量的候选字幕,可以根据选定的保留策略进行减少。

  • 输出样本的数量取决于保留策略以及是否保留原始样本。

Type 算子类型: mapper

Tags 标签: gpu, hf, multimodal

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

hf_img2seq

<class ‘str’>

'Salesforce/blip2-opt-2.7b'

model name on huggingface to generate caption

trust_remote_code

<class ‘bool’>

False

caption_num

typing.Annotated[int, Gt(gt=0)]

1

how many candidate captions to generate

keep_candidate_mode

<class ‘str’>

'random_any'

retain strategy for the generated

keep_original_sample

<class ‘bool’>

True

whether to keep the original sample. If

prompt

typing.Optional[str]

None

a string prompt to guide the generation of image-to-text

prompt_key

typing.Optional[str]

None

the key name of fields in samples to store prompts

frame_sampling_method

<class ‘str’>

'all_keyframes'

sampling method of extracting frame

frame_num

typing.Annotated[int, Gt(gt=0)]

3

the number of frames to be extracted uniformly from

horizontal_flip

<class ‘bool’>

False

flip frame video horizontally (left to right).

vertical_flip

<class ‘bool’>

False

flip frame video vertically (top to bottom).

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

not available 暂无