# video_captioning_from_frames_mapper Generates video captions from sampled frames using an image-to-text model. Captions from different frames are concatenated into a single string. - Uses a Hugging Face image-to-text model to generate captions for sampled video frames. - Supports different frame sampling methods: 'all_keyframes' or 'uniform'. - Can apply horizontal and vertical flips to the frames before captioning. - Offers multiple strategies for retaining generated captions: 'random_any', 'similar_one_simhash', or 'all'. - Optionally keeps the original sample in the final dataset. - Allows setting a global prompt or per-sample prompts to guide caption generation. - Generates a specified number of candidate captions per video, which can be reduced based on the selected retention strategy. - The number of output samples depends on the retention strategy and whether original samples are kept. 使用图像到文本模型从采样的帧中生成视频字幕。来自不同帧的字幕被连接成一个字符串。 - 使用Hugging Face图像到文本模型为采样的视频帧生成字幕。 - 支持不同的帧采样方法:'all_keyframes'或'uniform'。 - 可以在字幕前对帧进行水平和垂直翻转。 - 提供多种保留生成字幕的策略:'random_any'、'similar_one_simhash'或'all'。 - 可选地在最终数据集中保留原始样本。 - 允许设置全局提示或每个样本的提示来指导字幕生成。 - 为每个视频生成指定数量的候选字幕,可以根据选定的保留策略进行减少。 - 输出样本的数量取决于保留策略以及是否保留原始样本。 Type 算子类型: **mapper** Tags 标签: gpu, hf, multimodal ## 🔧 Parameter Configuration 参数配置 | name 参数名 | type 类型 | default 默认值 | desc 说明 | |--------|------|--------|------| | `hf_img2seq` | | `'Salesforce/blip2-opt-2.7b'` | model name on huggingface to generate caption | | `trust_remote_code` | | `False` | | | `caption_num` | typing.Annotated[int, Gt(gt=0)] | `1` | how many candidate captions to generate | | `keep_candidate_mode` | | `'random_any'` | retain strategy for the generated | | `keep_original_sample` | | `True` | whether to keep the original sample. If | | `prompt` | typing.Optional[str] | `None` | a string prompt to guide the generation of image-to-text | | `prompt_key` | typing.Optional[str] | `None` | the key name of fields in samples to store prompts | | `frame_sampling_method` | | `'all_keyframes'` | sampling method of extracting frame | | `frame_num` | typing.Annotated[int, Gt(gt=0)] | `3` | the number of frames to be extracted uniformly from | | `horizontal_flip` | | `False` | flip frame video horizontally (left to right). | | `vertical_flip` | | `False` | flip frame video vertically (top to bottom). | | `args` | | `''` | extra args | | `kwargs` | | `''` | extra args | ## 📊 Effect demonstration 效果演示 not available 暂无 ## 🔗 related links 相关链接 - [source code 源代码](../../../data_juicer/ops/mapper/video_captioning_from_frames_mapper.py) - [unit test 单元测试](../../../tests/ops/mapper/test_video_captioning_from_frames_mapper.py) - [Return operator list 返回算子列表](../../Operators.md)