data_juicer.ops.aggregator.nested_aggregator module¶
- class data_juicer.ops.aggregator.nested_aggregator.NestedAggregator(api_model: str = 'gpt-4o', input_key: str = 'event_description', output_key: str = None, max_token_num: Annotated[int, Gt(gt=0)] | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, sub_doc_template: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶
Bases:
Aggregator
Considering the limitation of input length, nested aggregate contents for each given number of samples.
- DEFAULT_SYSTEM_PROMPT = '给定一些文档碎片,将这些文档整合成一个文档总结。\n要求:\n- 总结的长度与文档碎片的平均长度基本一致\n- 不要包含主观看法\n- 注意要尽可能保留文本的专有名词\n- 只输出文档总结不要输出其他内容\n- 参考如下样例:\n文档碎片:\n唐僧师徒四人行至白虎岭,遇上了变化多端的白骨精。\n\n文档碎片:\n白骨精首次变身少女送斋,被孙悟空识破打死,唐僧责怪悟空。\n\n文档碎片:\n妖怪再变老妇寻女,又被悟空击毙,师傅更加不满,念紧箍咒惩罚。\n\n文档碎片:\n不甘心的白骨精第三次化作老公公来诱骗,依旧逃不过金睛火眼。\n\n文档碎片:\n最终,在观音菩萨的帮助下,真相大白,唐僧明白了自己的误解。\n\n\n文档总结:\n唐僧师徒在白虎岭三遇白骨精变化诱惑,悟空屡次识破击毙妖怪却遭误解,最终观音相助真相大白。'¶
- DEFAULT_INPUT_TEMPLATE = '{sub_docs}\n\n文档总结:\n'¶
- DEFAULT_SUB_DOC_TEMPLATE = '文档碎片:\n{text}\n'¶
- __init__(api_model: str = 'gpt-4o', input_key: str = 'event_description', output_key: str = None, max_token_num: Annotated[int, Gt(gt=0)] | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, sub_doc_template: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶
Initialization method. :param api_model: API model name. :param input_key: The input key in the meta field of the samples.
It is “event_description” in default.
- Parameters:
output_key – The output key in the aggregation field in the samples. It is same as the input_key in default.
max_token_num – The max token num of the total tokens of the sub documents. Without limitation if it is None.
api_endpoint – URL endpoint for the API.
response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.
system_prompt – The system prompt.
sub_doc_template – The template for input text in each sample.
input_template – The input template.
try_num – The number of retry attempts when there is an API call error or output parsing error.
model_params – Parameters for initializing the API model.
sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}
kwargs – Extra keyword arguments.