nested_aggregator¶
Aggregates nested content from multiple samples into a single summary.
This operator uses a recursive summarization approach to aggregate content from multiple samples. It processes the input text, which is split into sub-documents, and generates a summary that maintains the average length of the original documents. The aggregation is performed using an API model, guided by system prompts and templates. The operator supports retrying the API call in case of errors and allows for customization of the summarization process through various parameters. The default system prompt and templates are provided in Chinese, but they can be customized. The operator uses a Hugging Face tokenizer to handle tokenization.
聚合来自多个样本的嵌套内容,生成单一摘要。
该算子使用递归汇总方法来聚合来自多个样本的内容。它处理被分割成子文档的输入文本,并生成一个保持原始文档平均长度的摘要。聚合过程使用 API 模型进行,由系统提示和模板指导。该算子支持在出错时重试 API 调用,并允许通过各种参数自定义汇总过程。默认的系统提示和模板以中文提供,但可以自定义。该算子使用 Hugging Face 分词器来处理分词。
Type 算子类型: aggregator
Tags 标签: cpu, api, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
<class ‘str’> |
|
API model name. |
|
<class ‘str’> |
|
The input key in the meta field of the samples. It is “event_description” in default. |
|
<class ‘str’> |
|
The output key in the aggregation field in the samples. It is same as the input_key in default. |
|
typing.Optional[typing.Annotated[int, Gt(gt=0)]] |
|
The max token num of the total tokens of the sub documents. Without limitation if it is None. |
|
typing.Optional[str] |
|
URL endpoint for the API. |
|
typing.Optional[str] |
|
Path to extract content from the API response. Defaults to ‘choices.0.message.content’. |
|
typing.Optional[str] |
|
The system prompt. |
|
typing.Optional[str] |
|
The template for input text in each sample. |
|
typing.Optional[str] |
|
The input template. |
|
typing.Annotated[int, Gt(gt=0)] |
|
The number of retry attempts when there is an API call error or output parsing error. |
|
typing.Dict |
|
Parameters for initializing the API model. |
|
typing.Dict |
|
Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95} |
|
|
Extra keyword arguments. |
📊 Effect demonstration 效果演示¶
test_default_aggregator¶
NestedAggregator(api_model='qwen2.5-72b-instruct')
📥 input data 输入数据¶
📤 output data 输出数据¶
✨ explanation 解释¶
This example demonstrates the default behavior of the NestedAggregator operator. It takes a list of event descriptions and generates a summary that captures the key points from all the input texts. The output data includes the original event descriptions and an additional summary in the batch metadata, which provides a concise overview of the story. The summary is generated using the specified API model. 这个例子展示了NestedAggregator算子的默认行为。它接收一系列事件描述,并生成一个总结,概括所有输入文本的关键点。输出数据包括原始的事件描述和在批次元数据中添加的一个总结,该总结提供了故事的简要概述。总结是通过指定的API模型生成的。
test_max_token_num_1¶
NestedAggregator(api_model='qwen2.5-72b-instruct', max_token_num=2)
📥 input data 输入数据¶
📤 output data 输出数据¶
✨ explanation 解释¶
This example demonstrates how the NestedAggregator behaves with extreme token limitations (max_token_num=2). The restrictive token limit forces the algorithm to process documents in smaller, more granular groups through its recursive summarization approach. This results in a more detailed, fact-dense summary that captures specific events and details (like “fifteen years old defeating demons,” “Lotus Tower,” “shadow puppet show,” “treating the dead”) rather than a high-level narrative overview. The token limit affects the intermediate processing steps, leading to different summarization patterns that preserve more granular information in the final output. 这个例子展示了NestedAggregator在极端token限制(max_token_num=2)下的行为。限制性的token迫使算法通过其递归汇总方法以更小、更细粒度的组来处理文档。这导致生成更详细、信息密集的摘要,捕获具体事件和细节(如”十五岁胜天魔”、”莲花楼”、”皮影戏”、”救治死人”),而不是高层次的叙述概述。token限制影响了中间处理步骤,导致不同的汇总模式,在最终输出中保留了更多细粒度信息。