data_juicer.ops.aggregator.nested_aggregator module

class data_juicer.ops.aggregator.nested_aggregator.NestedAggregator(api_model: str = 'gpt-4o', input_key: str = 'event_description', output_key: str = None, max_token_num: Annotated[int, Gt(gt=0)] | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, sub_doc_template: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Aggregator

Considering the limitation of input length, nested aggregate contents for each given number of samples.

DEFAULT_SYSTEM_PROMPT = '给定一些文档碎片,将这些文档整合成一个文档总结。\n要求:\n- 总结的长度与文档碎片的平均长度基本一致\n- 不要包含主观看法\n- 注意要尽可能保留文本的专有名词\n- 只输出文档总结不要输出其他内容\n- 参考如下样例:\n文档碎片:\n唐僧师徒四人行至白虎岭,遇上了变化多端的白骨精。\n\n文档碎片:\n白骨精首次变身少女送斋,被孙悟空识破打死,唐僧责怪悟空。\n\n文档碎片:\n妖怪再变老妇寻女,又被悟空击毙,师傅更加不满,念紧箍咒惩罚。\n\n文档碎片:\n不甘心的白骨精第三次化作老公公来诱骗,依旧逃不过金睛火眼。\n\n文档碎片:\n最终,在观音菩萨的帮助下,真相大白,唐僧明白了自己的误解。\n\n\n文档总结:\n唐僧师徒在白虎岭三遇白骨精变化诱惑,悟空屡次识破击毙妖怪却遭误解,最终观音相助真相大白。'
DEFAULT_INPUT_TEMPLATE = '{sub_docs}\n\n文档总结:\n'
DEFAULT_SUB_DOC_TEMPLATE = '文档碎片:\n{text}\n'
__init__(api_model: str = 'gpt-4o', input_key: str = 'event_description', output_key: str = None, max_token_num: Annotated[int, Gt(gt=0)] | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, sub_doc_template: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param input_key: The input key in the meta field of the samples.

It is “event_description” in default.

Parameters:
  • output_key – The output key in the aggregation field in the samples. It is same as the input_key in default.

  • max_token_num – The max token num of the total tokens of the sub documents. Without limitation if it is None.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – The system prompt.

  • sub_doc_template – The template for input text in each sample.

  • input_template – The input template.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(response)[source]
recursive_summary(sub_docs, rank=None)[source]
process_single(sample=None, rank=None)[source]

For sample level, batched sample –> sample, the input must be the output of some Grouper OP.

Parameters:

sample – batched sample to aggregate

Returns:

aggregated sample