data_juicer.ops.mapper.generate_qa_from_examples_mapper module

class data_juicer.ops.mapper.generate_qa_from_examples_mapper.GenerateQAFromExamplesMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: Mapper

Mapper to generate question and answer pairs from examples. You should configure an empty dataset in your yaml config file: ``` generated_dataset_config:

type: ‘EmptyFormatter’ # use RayEmptyFormatter when enable ray length: ${The number of generated samples} feature_keys: ${text key}

``` The number of samples generated is determined by the length of the empty dataset.

DEFAULT_SYSTEM_PROMPT = '请你仔细观察多个示例数据的输入和输出,按照你的理解,总结出相应规矩,然后写出一个新的【问题】和【回答】。注意,新生成的【问题】和【回答】需要满足如下要求:\n1. 生成的【问题】和【回答】不能与输入的【问题】和【回答】一致,但是需要保持格式相同。\n2. 生成的【问题】不一定要局限于输入【问题】的话题或领域,生成的【回答】需要正确回答生成的【问题】。\n3. 提供的【问题】和【回答】可能是多轮对话,生成的【问题】和【回答】也可以是多轮,但是需要保持格式相同。\n4. 生成的【问题】和【回答】必须成对出现,而且【问题】需要在【回答】之前。\n'
DEFAULT_INPUT_TEMPLATE = '{}'
DEFAULT_EXAMPLE_TEMPLATE = '\n如下是一条示例数据:\n{}'
DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}\n'
DEFAULT_OUTPUT_PATTERN = '【问题】(.*?)【回答】(.*?)(?=【问题】|$)'
__init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – Huggingface model ID.

  • seed_file – Path to the seed file in chatml format.

  • example_num – The number of selected examples. Randomly select N examples from “seed_file” and put them into prompt as QA examples.

  • similarity_threshold – The similarity score threshold between the generated samples and the seed examples. Range from 0 to 1. Samples with similarity score less than this threshold will be kept.

  • system_prompt – System prompt for guiding the generation task.

  • input_template – Template for building the input prompt. It must include one placeholder ‘{}’, which will be replaced by example_num formatted examples defined by example_template.

  • example_template – Template for formatting one QA example. It must include one placeholder ‘{}’, which will be replaced by one formatted qa_pair.

  • qa_pair_template – Template for formatting a single QA pair within each example. Must include two placeholders ‘{}’ for the question and answer.

  • output_pattern – Regular expression pattern to extract questions and answers from model response.

  • enable_vllm – Whether to use vllm for inference acceleration.

  • model_params – Parameters for initializing the model.

  • sampling_params – Sampling parameters for text generation. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

build_input(qa_examples)[source]
parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample