data_juicer.ops.mapper.extract_entity_attribute_mapper module¶
- class data_juicer.ops.mapper.extract_entity_attribute_mapper.ExtractEntityAttributeMapper(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = 'main_entities', attribute_key: str = 'attributes', attribute_desc_key: str = 'attribute_descriptions', support_text_key: str = 'attribute_support_texts', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶
Bases:
Mapper
Extracts attributes for given entities from the text and stores them in the sample’s metadata.
This operator uses an API model to extract specified attributes for given entities from the input text. It constructs prompts based on provided templates and parses the model’s output to extract attribute descriptions and supporting text. The extracted data is stored in the sample’s metadata under the specified keys. If the required metadata fields already exist, the operator skips processing for that sample. The operator retries the API call and parsing up to a specified number of times in case of errors. The default system prompt, input template, and parsing patterns are used if not provided.
- DEFAULT_SYSTEM_PROMPT_TEMPLATE = '给定一段文本,从文本中总结{entity}的{attribute},并且从原文摘录最能说明该{attribute}的代表性示例。\n要求:\n- 摘录的示例应该简短。\n- 遵循如下的回复格式:\n# {entity}\n## {attribute}:\n...\n### 代表性示例摘录1:\n```\n...\n```\n### 代表性示例摘录2:\n```\n...\n```\n...\n'¶
- DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'¶
- DEFAULT_ATTR_PATTERN_TEMPLATE = '\\#\\#\\s*{attribute}:\\s*(.*?)(?=\\#\\#\\#|\\Z)'¶
- DEFAULT_DEMON_PATTERN = '\\#\\#\\#\\s*代表性示例摘录(\\d+):\\s*```\\s*(.*?)```\\s*(?=\\#\\#\\#|\\Z)'¶
- __init__(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = 'main_entities', attribute_key: str = 'attributes', attribute_desc_key: str = 'attribute_descriptions', support_text_key: str = 'attribute_support_texts', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶
Initialization method.
- Parameters:
api_model – API model name.
query_entities – Entity list to be queried.
query_attributes – Attribute list to be queried.
entity_key – The key name in the meta field to store the given main entity for attribute extraction. It’s “entity” in default.
attribute_key – The key name in the meta field to store the given attribute to be extracted. It’s “attribute” in default.
attribute_desc_key – The key name in the meta field to store the extracted attribute description. It’s “attribute_description” in default.
support_text_key – The key name in the meta field to store the attribute support text extracted from the raw text. It’s “support_text” in default.
api_endpoint – URL endpoint for the API.
response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.
system_prompt_template – System prompt template for the task. Need to be specified by given entity and attribute.
input_template – Template for building the model input.
attr_pattern_template – Pattern for parsing the attribute from output. Need to be specified by given attribute.
demo_pattern – Pattern for parsing the demonstration from output to support the attribute.
try_num – The number of retry attempts when there is an API call error or output parsing error.
drop_text – If drop the text in the output.
model_params – Parameters for initializing the API model.
sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}
kwargs – Extra keyword arguments.