data_juicer.ops.mapper.generate_qa_from_text_mapper module

class data_juicer.ops.mapper.generate_qa_from_text_mapper.GenerateQAFromTextMapper(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', max_num: Annotated[int, Gt(gt=0)] | None = None, *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: Mapper

Mapper to generate question and answer pairs from text. Recommended model list: [

‘alibaba-pai/pai-llama3-8b-doc2qa’, ‘alibaba-pai/pai-baichuan2-7b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-4b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-7b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-1b8-doc2qa’, ‘alibaba-pai/pai-qwen1_5-0b5-doc2qa’

] These recommended models are all trained with Chinese data and are suitable for Chinese.

__init__(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', max_num: Annotated[int, Gt(gt=0)] | None = None, *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – Huggingface model ID.

  • max_num – The max num of returned QA sample for each text. Not limit if it is None.

  • output_pattern – Regular expression pattern to extract questions and answers from model response.

  • enable_vllm – Whether to use vllm for inference acceleration.

  • model_params – Parameters for initializing the model.

  • sampling_params – Sampling parameters for text generation, e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

The default data format parsed by this interface is as follows: Model Input:

蒙古国的首都是乌兰巴托(Ulaanbaatar) 冰岛的首都是雷克雅未克(Reykjavik)

Model Output:

蒙古国的首都是乌兰巴托(Ulaanbaatar) 冰岛的首都是雷克雅未克(Reykjavik) Human: 请问蒙古国的首都是哪里? Assistant: 你好,根据提供的信息,蒙古国的首都是乌兰巴托(Ulaanbaatar)。 Human: 冰岛的首都是哪里呢? Assistant: 冰岛的首都是雷克雅未克(Reykjavik)。 …

parse_output(raw_output)[source]
process_batched(samples, rank=None)[source]