data_juicer.ops.mapper.generate_qa_from_text_mapper module¶
- class data_juicer.ops.mapper.generate_qa_from_text_mapper.GenerateQAFromTextMapper(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', max_num: Annotated[int, Gt(gt=0)] | None = None, *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]¶
Bases:
Mapper
Mapper to generate question and answer pairs from text. Recommended model list: [
‘alibaba-pai/pai-llama3-8b-doc2qa’, ‘alibaba-pai/pai-baichuan2-7b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-4b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-7b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-1b8-doc2qa’, ‘alibaba-pai/pai-qwen1_5-0b5-doc2qa’
] These recommended models are all trained with Chinese data and are suitable for Chinese.
- __init__(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', max_num: Annotated[int, Gt(gt=0)] | None = None, *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]¶
Initialization method.
- Parameters:
hf_model – Huggingface model ID.
max_num – The max num of returned QA sample for each text. Not limit if it is None.
output_pattern – Regular expression pattern to extract questions and answers from model response.
enable_vllm – Whether to use vllm for inference acceleration.
model_params – Parameters for initializing the model.
sampling_params – Sampling parameters for text generation, e.g {‘temperature’: 0.9, ‘top_p’: 0.95}
kwargs – Extra keyword arguments.
The default data format parsed by this interface is as follows: Model Input:
蒙古国的首都是乌兰巴托(Ulaanbaatar) 冰岛的首都是雷克雅未克(Reykjavik)
- Model Output:
蒙古国的首都是乌兰巴托(Ulaanbaatar) 冰岛的首都是雷克雅未克(Reykjavik) Human: 请问蒙古国的首都是哪里? Assistant: 你好,根据提供的信息,蒙古国的首都是乌兰巴托(Ulaanbaatar)。 Human: 冰岛的首都是哪里呢? Assistant: 冰岛的首都是雷克雅未克(Reykjavik)。 …