data_juicer.ops.mapper.extract_entity_attribute_mapper module

class data_juicer.ops.mapper.extract_entity_attribute_mapper.ExtractEntityAttributeMapper(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = 'main_entities', attribute_key: str = 'attributes', attribute_desc_key: str = 'attribute_descriptions', support_text_key: str = 'attribute_support_texts', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract attributes for given entities from the text

DEFAULT_SYSTEM_PROMPT_TEMPLATE = '给定一段文本,从文本中总结{entity}的{attribute},并且从原文摘录最能说明该{attribute}的代表性示例。\n要求:\n- 摘录的示例应该简短。\n- 遵循如下的回复格式:\n# {entity}\n## {attribute}:\n...\n### 代表性示例摘录1:\n```\n...\n```\n### 代表性示例摘录2:\n```\n...\n```\n...\n'
DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'
DEFAULT_ATTR_PATTERN_TEMPLATE = '\\#\\#\\s*{attribute}:\\s*(.*?)(?=\\#\\#\\#|\\Z)'
DEFAULT_DEMON_PATTERN = '\\#\\#\\#\\s*代表性示例摘录(\\d+):\\s*```\\s*(.*?)```\\s*(?=\\#\\#\\#|\\Z)'
__init__(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = 'main_entities', attribute_key: str = 'attributes', attribute_desc_key: str = 'attribute_descriptions', support_text_key: str = 'attribute_support_texts', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param query_entities: Entity list to be queried. :param query_attributes: Attribute list to be queried. :param entity_key: The key name in the meta field to store the

given main entity for attribute extraction. It’s “entity” in default.

Parameters:
  • entity_attribute_key – The key name in the meta field to store the given attribute to be extracted. It’s “attribute” in default.

  • attribute_desc_key – The key name in the meta field to store the extracted attribute description. It’s “attribute_description” in default.

  • support_text_key – The key name in the meta field to store the attribute support text extracted from the raw text. It’s “support_text” in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt_template – System prompt template for the task. Need to be specified by given entity and attribute.

  • input_template – Template for building the model input.

  • attr_pattern_template – Pattern for parsing the attribute from output. Need to be specified by given attribute.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

Param:

demo_pattern: Pattern for parsing the demonstration from output to support the attribute.

parse_output(raw_output, attribute_name)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample