data_juicer.ops.mapper.extract_event_mapper module

class data_juicer.ops.mapper.extract_event_mapper.ExtractEventMapper(api_model: str = 'gpt-4o', *, event_desc_key: str = 'event_description', relevant_char_key: str = 'relevant_characters', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]

基类:Mapper

Extracts events and relevant characters from the text.

This operator uses an API model to summarize the text into multiple events and extract the relevant characters for each event. The summary and character extraction follow a predefined format. The operator retries the API call up to a specified number of times if there is an error. The extracted events and characters are stored in the meta field of the samples. If no events are found, the original samples are returned. The operator can optionally drop the original text after processing.

DEFAULT_SYSTEM_PROMPT = '给定一段文本,对文本的情节进行分点总结,并抽取与情节相关的人物。\n要求:\n- 尽量不要遗漏内容,不要添加文本中没有的情节,符合原文事实\n- 联系上下文说明前因后果,但仍然需要符合事实\n- 不要包含主观看法\n- 注意要尽可能保留文本的专有名词\n- 注意相关人物需要在对应情节中出现\n- 只抽取情节中的主要人物,不要遗漏情节的主要人物\n- 总结格式如下:\n### 情节1:\n- **情节描述**: ...\n- **相关人物**:人物1,人物2,人物3,...\n### 情节2:\n- **情节描述**: ...\n- **相关人物**:人物1,人物2,...\n### 情节3:\n- **情节描述**: ...\n- **相关人物**:人物1,...\n...\n'
DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'
DEFAULT_OUTPUT_PATTERN = '\n        \\#\\#\\#\\s*情节(\\d+):\\s*\n        -\\s*\\*\\*情节描述\\*\\*\\s*:\\s*(.*?)\\s*\n        -\\s*\\*\\*相关人物\\*\\*\\s*:\\s*(.*?)(?=\\#\\#\\#|\\Z)\n    '
__init__(api_model: str = 'gpt-4o', *, event_desc_key: str = 'event_description', relevant_char_key: str = 'relevant_characters', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]

Initialization method. :param api_model: API model name. :param event_desc_key: The key name to store the event descriptions

in the meta field. It's "event_description" in default.

参数:
  • relevant_char_key -- The field name to store the relevant characters to the events in the meta field. It's "relevant_characters" in default.

  • api_endpoint -- URL endpoint for the API.

  • response_path -- Path to extract content from the API response. Defaults to 'choices.0.message.content'.

  • system_prompt -- System prompt for the task.

  • input_template -- Template for building the model input.

  • output_pattern -- Regular expression for parsing model output.

  • try_num -- The number of retry attempts when there is an API call error or output parsing error.

  • drop_text -- If drop the text in the output.

  • model_params -- Parameters for initializing the API model.

  • sampling_params -- Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}

  • kwargs -- Extra keyword arguments.

parse_output(raw_output)[源代码]
process_batched(samples, rank=None)[源代码]