data_juicer.ops.mapper.image_captioning_mapper module¶
- class data_juicer.ops.mapper.image_captioning_mapper.ImageCaptioningMapper(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, *args, **kwargs)[source]¶
Bases:
Mapper
Mapper to generate samples whose captions are generated based on another model and the figure.
- __init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
hf_img2seq – model name on huggingface to generate caption
caption_num – how many candidate captions to generate for each image
keep_candidate_mode –
retain strategy for the generated $caption_num$ candidates.
’random_any’: Retain the random one from generated captions
- ’similar_one_simhash’: Retain the generated one that is most
similar to the original caption
’all’: Retain all generated captions by concatenation
Note
This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.
- Parameters:
keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.
prompt – a string prompt to guide the generation of blip2 model for all samples globally. It’s None in default, which means no prompt provided.
prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.
args – extra args
kwargs – extra args
- process_batched(samples, rank=None)[source]¶
Note
This is a batched_OP, whose input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.
- Parameters:
samples
- Returns: