data_juicer.ops.mapper.extract_support_text_mapper module

class data_juicer.ops.mapper.extract_support_text_mapper.ExtractSupportTextMapper(api_model: str = 'gpt-4o', *, summary_key: str = 'event_description', support_text_key: str = 'support_text', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]

基类:Mapper

Extracts a supporting sub-text from the original text based on a given summary.

This operator uses an API model to identify and extract a segment of the original text that best matches the provided summary. It leverages a system prompt and input template to guide the extraction process. The extracted support text is stored in the specified meta field key. If the extraction fails or returns an empty string, the original summary is used as a fallback. The operator retries the extraction up to a specified number of times in case of errors.

DEFAULT_SYSTEM_PROMPT = '你将扮演一个文本摘录助手的角色。你的主要任务是基于给定的文章(称为“原文”)以及对原文某个部分的简短描述或总结(称为“总结”),准确地识别并提取出与该总结相对应的原文片段。\n要求:\n- 你需要尽可能精确地匹配到最符合总结内容的那部分内容\n- 如果存在多个可能的答案,请选择最贴近总结意思的那个\n- 下面是一个例子帮助理解这一过程:\n### 原文:\n《红楼梦》是中国古典小说四大名著之一,由清代作家曹雪芹创作。它讲述了贾宝玉、林黛玉等人的爱情故事及四大家族的兴衰历程。书中通过复杂的人物关系展现了封建社会的各种矛盾冲突。其中关于贾府内部斗争的部分尤其精彩,特别是王熙凤与尤二姐之间的争斗,生动描绘了权力争夺下的女性形象。此外,《红楼梦》还以其精美的诗词闻名,这些诗词不仅增添了文学色彩,也深刻反映了人物的性格特点和命运走向。\n\n### 总结:\n描述了书中的两个女性角色之间围绕权力展开的竞争。\n\n### 原文摘录:\n其中关于贾府内部斗争的部分尤其精彩,特别是王熙凤与尤二姐之间的争斗,生动描绘了权力争夺下的女性形象。'
DEFAULT_INPUT_TEMPLATE = '### 原文:\n{text}\n\n### 总结:\n{summary}\n\n### 原文摘录:\n'
__init__(api_model: str = 'gpt-4o', *, summary_key: str = 'event_description', support_text_key: str = 'support_text', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]

Initialization method. :param api_model: API model name. :param summary_key: The key name to store the input summary in the

meta field. It's "event_description" in default.

参数:
  • support_text_key -- The key name to store the output support text for the summary in the meta field. It's "support_text" in default.

  • api_endpoint -- URL endpoint for the API.

  • response_path -- Path to extract content from the API response. Defaults to 'choices.0.message.content'.

  • system_prompt -- System prompt for the task.

  • input_template -- Template for building the model input.

  • try_num -- The number of retry attempts when there is an API call error or output parsing error.

  • drop_text -- If drop the text in the output.

  • model_params -- Parameters for initializing the API model.

  • sampling_params -- Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}

  • kwargs -- Extra keyword arguments.

process_single(sample, rank=None)[源代码]

For sample level, sample --> sample

参数:

sample -- sample to process

返回:

processed sample