data_juicer.ops.mapper.generate_qa_from_examples_mapper module¶
- class data_juicer.ops.mapper.generate_qa_from_examples_mapper.GenerateQAFromExamplesMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[源代码]¶
基类:
Mapper
Mapper to generate question and answer pairs from examples. You should configure an empty dataset in your yaml config file: ``` generated_dataset_config:
type: 'EmptyFormatter' # use RayEmptyFormatter when enable ray length: ${The number of generated samples} feature_keys: ${text key}
``` The number of samples generated is determined by the length of the empty dataset.
- DEFAULT_SYSTEM_PROMPT = '请你仔细观察多个示例数据的输入和输出,按照你的理解,总结出相应规矩,然后写出一个新的【问题】和【回答】。注意,新生成的【问题】和【回答】需要满足如下要求:\n1. 生成的【问题】和【回答】不能与输入的【问题】和【回答】一致,但是需要保持格式相同。\n2. 生成的【问题】不一定要局限于输入【问题】的话题或领域,生成的【回答】需要正确回答生成的【问题】。\n3. 提供的【问题】和【回答】可能是多轮对话,生成的【问题】和【回答】也可以是多轮,但是需要保持格式相同。\n4. 生成的【问题】和【回答】必须成对出现,而且【问题】需要在【回答】之前。\n'¶
- DEFAULT_INPUT_TEMPLATE = '{}'¶
- DEFAULT_EXAMPLE_TEMPLATE = '\n如下是一条示例数据:\n{}'¶
- DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}\n'¶
- DEFAULT_OUTPUT_PATTERN = '【问题】(.*?)【回答】(.*?)(?=【问题】|$)'¶
- __init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[源代码]¶
Initialization method.
- 参数:
hf_model -- Huggingface model ID.
seed_file -- Path to the seed file in chatml format.
example_num -- The number of selected examples. Randomly select N examples from "seed_file" and put them into prompt as QA examples.
similarity_threshold -- The similarity score threshold between the generated samples and the seed examples. Range from 0 to 1. Samples with similarity score less than this threshold will be kept.
system_prompt -- System prompt for guiding the generation task.
input_template -- Template for building the input prompt. It must include one placeholder '{}', which will be replaced by example_num formatted examples defined by example_template.
example_template -- Template for formatting one QA example. It must include one placeholder '{}', which will be replaced by one formatted qa_pair.
qa_pair_template -- Template for formatting a single QA pair within each example. Must include two placeholders '{}' for the question and answer.
output_pattern -- Regular expression pattern to extract questions and answers from model response.
enable_vllm -- Whether to use vllm for inference acceleration.
model_params -- Parameters for initializing the model.
sampling_params -- Sampling parameters for text generation. e.g {'temperature': 0.9, 'top_p': 0.95}
kwargs -- Extra keyword arguments.