data_juicer.ops.mapper.generate_qa_from_text_mapper module¶

class data_juicer.ops.mapper.generate_qa_from_text_mapper.GenerateQAFromTextMapper(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', max_num: Annotated[int, Gt(gt=0)] | None = None, *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]¶

Bases: Mapper

Generates question and answer pairs from text using a specified model.

This operator uses a Hugging Face model to generate QA pairs from the input text. It supports both Hugging Face and vLLM models for inference. The recommended models, such as ‘alibaba-pai/pai-llama3-8b-doc2qa’, are trained on Chinese data and are suitable for Chinese text. The operator can limit the number of generated QA pairs per text and allows custom output patterns for parsing the model’s response. By default, it uses a regular expression to extract questions and answers from the model’s output. If no QA pairs are extracted, a warning is logged.

__init__(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', max_num: Annotated[int, Gt(gt=0)] | None = None, *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]¶

Initialization method.

Parameters:

hf_model – Huggingface model ID.
max_num – The max num of returned QA sample for each text. Not limit if it is None.
output_pattern – Regular expression pattern to extract questions and answers from model response.
enable_vllm – Whether to use vllm for inference acceleration.
model_params – Parameters for initializing the model.
sampling_params – Sampling parameters for text generation, e.g {‘temperature’: 0.9, ‘top_p’: 0.95}
kwargs – Extra keyword arguments.

The default data format parsed by this interface is as follows: Model Input:

蒙古国的首都是乌兰巴托（Ulaanbaatar）冰岛的首都是雷克雅未克（Reykjavik）

Model Output:: 蒙古国的首都是乌兰巴托（Ulaanbaatar）冰岛的首都是雷克雅未克（Reykjavik） Human: 请问蒙古国的首都是哪里？ Assistant: 你好，根据提供的信息，蒙古国的首都是乌兰巴托（Ulaanbaatar）。 Human: 冰岛的首都是哪里呢？ Assistant: 冰岛的首都是雷克雅未克（Reykjavik）。 …

parse_output(raw_output)[source]¶

process_batched(samples, rank=None)[source]¶