data_juicer.ops.mapper.optimize_qa_mapper module¶

class data_juicer.ops.mapper.optimize_qa_mapper.OptimizeQAMapper(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', is_hf_model: bool = True, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]¶

Bases: Mapper

Mapper to optimize question-answer pairs.

This operator refines and enhances the quality of question-answer pairs. It uses a Hugging Face model to generate more detailed and accurate questions and answers. The input is formatted using a template, and the output is parsed using a regular expression. The system prompt, input template, and output pattern can be customized. If VLLM is enabled, the operator accelerates inference on CUDA devices.

DEFAULT_SYSTEM_PROMPT = '请优化输入的问答对，使【问题】和【回答】都更加详细、准确。必须按照以下标记格式，直接输出优化后的问答对：\n【问题】\n优化后的问题\n【回答】\n优化后的回答'¶

DEFAULT_INPUT_TEMPLATE = '以下是原始问答对：\n{}'¶

DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}'¶

DEFAULT_OUTPUT_PATTERN = '.*?【问题】\\s*(.*?)\\s*【回答】\\s*(.*)'¶

__init__(api_or_hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', is_hf_model: bool = True, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]¶

Initialization method.

Parameters:

api_or_hf_model – API or huggingface model name.
is_hf_model – If true, use huggingface model. Otherwise, use API.
api_endpoint – URL endpoint for the API.
response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.
system_prompt – System prompt for guiding the optimization task.
input_template – Template for building the input for the model. Please make sure the template contains one placeholder ‘{}’, which corresponds to the question and answer pair generated by param qa_pair_template.
qa_pair_template – Template for formatting the question and answer pair. Please make sure the template contains two ‘{}’ to format question and answer.
output_pattern – Regular expression pattern to extract question and answer from model response.
try_num – The number of retry attempts when there is an API call error or output parsing error.
enable_vllm – Whether to use VLLM for inference acceleration.
model_params – Parameters for initializing the model.
sampling_params – Sampling parameters for text generation (e.g., {‘temperature’: 0.9, ‘top_p’: 0.95}).
kwargs – Extra keyword arguments.

build_input(sample)[source]¶

parse_output(raw_output)[source]¶

process_single(sample, rank=None)[source]¶

For sample level, sample –> sample

Parameters:: sample – sample to process
Returns:: processed sample