data_juicer.ops.aggregator.entity_attribute_aggregator module¶

class data_juicer.ops.aggregator.entity_attribute_aggregator.EntityAttributeAggregator(api_model: str = 'gpt-4o', entity: str = None, attribute: str = None, input_key: str = 'event_description', output_key: str = 'entity_attribute', word_limit: Annotated[int, Gt(gt=0)] = 100, max_token_num: Annotated[int, Gt(gt=0)] | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, example_prompt: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶

Bases: Aggregator

Summarizes a given attribute of an entity from a set of documents.

The operator extracts and summarizes the specified attribute of a given entity from the provided documents. It uses a system prompt, example prompt, and input template to generate the summary. The output is formatted as a markdown-style summary with the entity and attribute clearly labeled. The summary is limited to a specified number of words (default is 100). The operator uses a Hugging Face tokenizer to handle token limits and splits documents if necessary. If the input key or required fields are missing, the operator logs a warning and returns the sample unchanged. The summary is stored in the batch metadata under the specified output key. The system prompt, input template, example prompt, and output pattern can be customized.

DEFAULT_SYSTEM_TEMPLATE = '给定与`{entity}`相关的一些文档，总结`{entity}`的`{attribute}`。\n要求：\n- 尽量使用原文专有名词\n- 联系上下文，自动忽略上下文不一致的细节错误\n- 只对文档中与`{entity}`的`{attribute}`有关的内容进行总结\n- 字数限制在**{word_limit}字以内**\n- 要求输出格式如下：\n# {entity}\n## {attribute}\n...\n{example}'¶

DEFAULT_EXAMPLE_PROMPT = '- 例如，根据相关文档总结`孙悟空`的`出身背景`，**100字**以内的样例如下：\n`孙悟空`的`出身背景`总结：\n# 孙悟空\n## 出身背景\n号称齐天大圣，花果山水帘洞的美猴王、西行取经队伍中的大师兄。师父是唐僧玄奘，曾拜菩提祖师学艺。亲生父母未知，自石头中孕育而生。自认斗战胜佛，最怕观世音菩萨和紧箍咒。\n'¶

DEFAULT_INPUT_TEMPLATE = '`{entity}`的相关文档：\n{sub_docs}\n\n`{entity}`的`{attribute}`总结：\n'¶

DEFAULT_OUTPUT_PATTERN_TEMPLATE = '\\#\\s*{entity}\\s*\\#\\#\\s*{attribute}\\s*(.*?)\\Z'¶

__init__(api_model: str = 'gpt-4o', entity: str = None, attribute: str = None, input_key: str = 'event_description', output_key: str = 'entity_attribute', word_limit: Annotated[int, Gt(gt=0)] = 100, max_token_num: Annotated[int, Gt(gt=0)] | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, example_prompt: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶

Initialization method. :param api_model: API model name. :param entity: The given entity. :param attribute: The given attribute. :param input_key: The input key in the meta field of the samples.

It is “event_description” in default.

Parameters:

output_key – The output key in the aggregation field of the samples. It is “entity_attribute” in default.
word_limit – Prompt the output length.
max_token_num – The max token num of the total tokens of the sub documents. Without limitation if it is None.
api_endpoint – URL endpoint for the API.
response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.
system_prompt_template – The system prompt template.
example_prompt – The example part in the system prompt.
input_template – The input template.
output_pattern_template – The output template.
try_num – The number of retry attempts when there is an API call error or output parsing error.
model_params – Parameters for initializing the API model.
sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}
kwargs – Extra keyword arguments.

parse_output(response)[source]¶

attribute_summary(sub_docs, rank=None)[source]¶

process_single(sample=None, rank=None)[source]¶

For sample level, batched sample –> sample, the input must be the output of some Grouper OP.

Parameters:: sample – batched sample to aggregate
Returns:: aggregated sample