data_juicer.ops.mapper.extract_entity_attribute_mapper module¶
- class data_juicer.ops.mapper.extract_entity_attribute_mapper.ExtractEntityAttributeMapper(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = 'main_entities', attribute_key: str = 'attributes', attribute_desc_key: str = 'attribute_descriptions', support_text_key: str = 'attribute_support_texts', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶
Bases:
Mapper
Extract attributes for given entities from the text
- DEFAULT_SYSTEM_PROMPT_TEMPLATE = '给定一段文本,从文本中总结{entity}的{attribute},并且从原文摘录最能说明该{attribute}的代表性示例。\n要求:\n- 摘录的示例应该简短。\n- 遵循如下的回复格式:\n# {entity}\n## {attribute}:\n...\n### 代表性示例摘录1:\n```\n...\n```\n### 代表性示例摘录2:\n```\n...\n```\n...\n'¶
- DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'¶
- DEFAULT_ATTR_PATTERN_TEMPLATE = '\\#\\#\\s*{attribute}:\\s*(.*?)(?=\\#\\#\\#|\\Z)'¶
- DEFAULT_DEMON_PATTERN = '\\#\\#\\#\\s*代表性示例摘录(\\d+):\\s*```\\s*(.*?)```\\s*(?=\\#\\#\\#|\\Z)'¶
- __init__(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = 'main_entities', attribute_key: str = 'attributes', attribute_desc_key: str = 'attribute_descriptions', support_text_key: str = 'attribute_support_texts', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶
Initialization method. :param api_model: API model name. :param query_entities: Entity list to be queried. :param query_attributes: Attribute list to be queried. :param entity_key: The key name in the meta field to store the
given main entity for attribute extraction. It’s “entity” in default.
- Parameters:
entity_attribute_key – The key name in the meta field to store the given attribute to be extracted. It’s “attribute” in default.
attribute_desc_key – The key name in the meta field to store the extracted attribute description. It’s “attribute_description” in default.
support_text_key – The key name in the meta field to store the attribute support text extracted from the raw text. It’s “support_text” in default.
api_endpoint – URL endpoint for the API.
response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.
system_prompt_template – System prompt template for the task. Need to be specified by given entity and attribute.
input_template – Template for building the model input.
attr_pattern_template – Pattern for parsing the attribute from output. Need to be specified by given attribute.
try_num – The number of retry attempts when there is an API call error or output parsing error.
drop_text – If drop the text in the output.
model_params – Parameters for initializing the API model.
sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}
kwargs – Extra keyword arguments.
- Param:
demo_pattern: Pattern for parsing the demonstration from output to support the attribute.