data_juicer.ops.mapper.optimize_prompt_mapper module¶

class data_juicer.ops.mapper.optimize_prompt_mapper.OptimizePromptMapper(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', gen_num: Annotated[int, Gt(gt=0)] = 3, max_example_num: Annotated[int, Gt(gt=0)] = 3, keep_original_sample: bool = True, retry_num: int = 3, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, prompt_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, is_hf_model: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[源代码]¶

基类：Mapper

Optimize prompts based on existing ones in the same batch.

This operator uses the existing prompts and newly optimized prompts as examples to generate better prompts. It supports using a Hugging Face model or an API for text generation. The operator can be configured to keep the original samples or replace them with the generated ones. The optimization process involves multiple retries if the generated prompt is empty. The operator operates in batch mode and can leverage vLLM for inference acceleration on CUDA devices.

Uses existing and newly generated prompts to optimize future prompts.
Supports both Hugging Face models and API-based text generation.
Can keep or replace original samples with generated ones.
Retries up to a specified number of times if the generated prompt is empty.
Operates in batch mode and can use vLLM for acceleration on CUDA.
References: https://doc.agentscope.io/v0/en/build_tutorial/prompt_optimization.html

DEFAULT_SYSTEM_PROMPT = '请你仔细观察多个示例提示词，按照你的理解，总结出相应规矩，然后写出一个新的更好的提示词，以让模型更好地完成指定任务。注意，新生成的【提示词】需要满足如下要求：\n1. 生成的【提示词】不能与输入的【提示词】完全一致，但是需要保持格式类似。\n2. 生成的【提示词】相比于输入的【提示词】不能有很大的变化，更多应该是关键词、核心参数等方面的微调。\n3. 生成时只需生成带有【提示词】前缀的提示词，不需生成其他任何额外信息。\n'¶

DEFAULT_INPUT_TEMPLATE = '{}'¶

DEFAULT_EXAMPLE_TEMPLATE = '\n如下是一条示例数据：\n{}'¶

DEFAULT_PROMPT_TEMPLATE = '【提示词】\n{}\n'¶

DEFAULT_OUTPUT_PATTERN = '【提示词】(.*?)(?=【|$)'¶

__init__(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', gen_num: Annotated[int, Gt(gt=0)] = 3, max_example_num: Annotated[int, Gt(gt=0)] = 3, keep_original_sample: bool = True, retry_num: int = 3, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, prompt_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, is_hf_model: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[源代码]¶

Initialization method.

参数:

api_or_hf_model -- API or huggingface model name.
gen_num -- The number of new prompts to generate.
max_example_num -- Maximum number of example prompts to include as context when generating new optimized prompts.
keep_original_sample -- whether to keep the original sample. If it's set to False, there will be only generated texts in the final datasets and the original texts will be removed. It's True in default.
retry_num -- how many times to retry to generate the prompt if the parsed generated prompt is empty. It's 3 in default.
api_endpoint -- URL endpoint for the API.
response_path -- Path to extract content from the API response. Defaults to 'choices.0.message.content'.
system_prompt -- System prompt for guiding the generation task.
input_template -- Template for building the input prompt. It must include one placeholder '{}', which will be replaced by example_num formatted examples defined by example_template.
example_template -- Template for formatting one prompt example. It must include one placeholder '{}', which will be replaced by one formatted prompt.
prompt_template -- Template for formatting a single prompt within each example. Must include two placeholders '{}' for the question and answer.
output_pattern -- Regular expression pattern to extract questions and answers from model response.
enable_vllm -- Whether to use vllm for inference acceleration.
is_hf_model -- If true, use Transformers for loading hugging face or local llm.
model_params -- Parameters for initializing the model.
sampling_params -- Sampling parameters for text generation. e.g {'temperature': 0.9, 'top_p': 0.95}
kwargs -- Extra keyword arguments.

build_input(prompt_examples)[源代码]¶

parse_output(raw_output)[源代码]¶

generate_one_prompt(model, input_prompt_samples)[源代码]¶

process_batched(samples, rank=None, *args, **kwargs)[源代码]¶