data_juicer.ops.mapper.extract_event_mapper module

class data_juicer.ops.mapper.extract_event_mapper.ExtractEventMapper(api_model: str = 'gpt-4o', *, event_desc_key: str = 'event_description', relevant_char_key: str = 'relevant_characters', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]

基类:Mapper

Extract events and relevant characters in the text

DEFAULT_SYSTEM_PROMPT = '给定一段文本,对文本的情节进行分点总结,并抽取与情节相关的人物。\n要求:\n- 尽量不要遗漏内容,不要添加文本中没有的情节,符合原文事实\n- 联系上下文说明前因后果,但仍然需要符合事实\n- 不要包含主观看法\n- 注意要尽可能保留文本的专有名词\n- 注意相关人物需要在对应情节中出现\n- 只抽取情节中的主要人物,不要遗漏情节的主要人物\n- 总结格式如下:\n### 情节1:\n- **情节描述**: ...\n- **相关人物**:人物1,人物2,人物3,...\n### 情节2:\n- **情节描述**: ...\n- **相关人物**:人物1,人物2,...\n### 情节3:\n- **情节描述**: ...\n- **相关人物**:人物1,...\n...\n'
DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'
DEFAULT_OUTPUT_PATTERN = '\n        \\#\\#\\#\\s*情节(\\d+):\\s*\n        -\\s*\\*\\*情节描述\\*\\*\\s*:\\s*(.*?)\\s*\n        -\\s*\\*\\*相关人物\\*\\*\\s*:\\s*(.*?)(?=\\#\\#\\#|\\Z)\n    '
__init__(api_model: str = 'gpt-4o', *, event_desc_key: str = 'event_description', relevant_char_key: str = 'relevant_characters', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]

Initialization method. :param api_model: API model name. :param event_desc_key: The key name to store the event descriptions

in the meta field. It's "event_description" in default.

参数:
  • relevant_char_key -- The field name to store the relevant characters to the events in the meta field. It's "relevant_characters" in default.

  • api_endpoint -- URL endpoint for the API.

  • response_path -- Path to extract content from the API response. Defaults to 'choices.0.message.content'.

  • system_prompt -- System prompt for the task.

  • input_template -- Template for building the model input.

  • output_pattern -- Regular expression for parsing model output.

  • try_num -- The number of retry attempts when there is an API call error or output parsing error.

  • drop_text -- If drop the text in the output.

  • model_params -- Parameters for initializing the API model.

  • sampling_params -- Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}

  • kwargs -- Extra keyword arguments.

parse_output(raw_output)[源代码]
process_batched(samples, rank=None)[源代码]