data_juicer.ops.aggregator.entity_attribute_aggregator module

class data_juicer.ops.aggregator.entity_attribute_aggregator.EntityAttributeAggregator(api_model: str = 'gpt-4o', entity: str = None, attribute: str = None, input_key: str = 'event_description', output_key: str = 'entity_attribute', word_limit: Annotated[int, Gt(gt=0)] = 100, max_token_num: Annotated[int, Gt(gt=0)] | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, example_prompt: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]

基类:Aggregator

Return conclusion of the given entity's attribute from some docs.

DEFAULT_SYSTEM_TEMPLATE = '给定与`{entity}`相关的一些文档,总结`{entity}`的`{attribute}`。\n要求:\n- 尽量使用原文专有名词\n- 联系上下文,自动忽略上下文不一致的细节错误\n- 只对文档中与`{entity}`的`{attribute}`有关的内容进行总结\n- 字数限制在**{word_limit}字以内**\n- 要求输出格式如下:\n# {entity}\n## {attribute}\n...\n{example}'
DEFAULT_EXAMPLE_PROMPT = '- 例如,根据相关文档总结`孙悟空`的`出身背景`,**100字**以内的样例如下:\n`孙悟空`的`出身背景`总结:\n# 孙悟空\n## 出身背景\n号称齐天大圣,花果山水帘洞的美猴王、西行取经队伍中的大师兄。师父是唐僧玄奘,曾拜菩提祖师学艺。亲生父母未知,自石头中孕育而生。自认斗战胜佛,最怕观世音菩萨和紧箍咒。\n'
DEFAULT_INPUT_TEMPLATE = '`{entity}`的相关文档:\n{sub_docs}\n\n`{entity}`的`{attribute}`总结:\n'
DEFAULT_OUTPUT_PATTERN_TEMPLATE = '\\#\\s*{entity}\\s*\\#\\#\\s*{attribute}\\s*(.*?)\\Z'
__init__(api_model: str = 'gpt-4o', entity: str = None, attribute: str = None, input_key: str = 'event_description', output_key: str = 'entity_attribute', word_limit: Annotated[int, Gt(gt=0)] = 100, max_token_num: Annotated[int, Gt(gt=0)] | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, example_prompt: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]

Initialization method. :param api_model: API model name. :param entity: The given entity. :param attribute: The given attribute. :param input_key: The input key in the meta field of the samples.

It is "event_description" in default.

参数:
  • output_key -- The output key in the aggregation field of the samples. It is "entity_attribute" in default.

  • word_limit -- Prompt the output length.

  • max_token_num -- The max token num of the total tokens of the sub documents. Without limitation if it is None.

  • api_endpoint -- URL endpoint for the API.

  • response_path -- Path to extract content from the API response. Defaults to 'choices.0.message.content'.

  • system_prompt_template -- The system prompt template.

  • example_prompt -- The example part in the system prompt.

  • input_template -- The input template.

  • output_pattern_template -- The output template.

  • try_num -- The number of retry attempts when there is an API call error or output parsing error.

  • model_params -- Parameters for initializing the API model.

  • sampling_params -- Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}

  • kwargs -- Extra keyword arguments.

parse_output(response)[源代码]
attribute_summary(sub_docs, rank=None)[源代码]
process_single(sample=None, rank=None)[源代码]

For sample level, batched sample --> sample, the input must be the output of some Grouper OP.

参数:

sample -- batched sample to aggregate

返回:

aggregated sample