data_juicer.ops.mapper.calibrate_qa_mapper module¶

class data_juicer.ops.mapper.calibrate_qa_mapper.CalibrateQAMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶

Bases: Mapper

Calibrates question-answer pairs based on reference text using an API model.

This operator uses a specified API model to calibrate question-answer pairs, making them more detailed and accurate. It constructs the input prompt by combining the reference text and the question-answer pair, then sends it to the API for calibration. The output is parsed to extract the calibrated question and answer. The operator retries the API call and parsing up to a specified number of times in case of errors. The default system prompt, input templates, and output pattern can be customized. The operator supports additional parameters for model initialization and sampling.

DEFAULT_SYSTEM_PROMPT = '请根据提供的【参考信息】对【问题】和【回答】进行校准，使其更加详细、准确。\n按照以下格式输出：\n【问题】\n校准后的问题\n【回答】\n校准后的回答'¶

DEFAULT_INPUT_TEMPLATE = '{reference}\n{qa_pair}'¶

DEFAULT_REFERENCE_TEMPLATE = '【参考信息】\n{}'¶

DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}'¶

DEFAULT_OUTPUT_PATTERN = '【问题】\\s*(.*?)\\s*【回答】\\s*(.*)'¶

__init__(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶

Initialization method.

Parameters:

api_model – API model name.
api_endpoint – URL endpoint for the API.
response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.
system_prompt – System prompt for the calibration task.
input_template – Template for building the model input.
reference_template – Template for formatting the reference text.
qa_pair_template – Template for formatting question-answer pairs.
output_pattern – Regular expression for parsing model output.
try_num – The number of retry attempts when there is an API call error or output parsing error.
model_params – Parameters for initializing the API model.
sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}
kwargs – Extra keyword arguments.

build_input(sample)[source]¶

parse_output(raw_output)[source]¶

process_single(sample, rank=None)[source]¶

For sample level, sample –> sample

Parameters:: sample – sample to process
Returns:: processed sample