key_value_grouper

Groups samples into batches based on values in specified keys.

This operator groups samples by the values of the given keys, which can be nested. If no keys are provided, it defaults to using the text key. It uses a naive grouping strategy to batch samples with identical key values. The resulting dataset is a list of batched samples, where each batch contains samples that share the same key values. This is useful for organizing data by specific attributes or features.

根据指定键的值对样本进行分组。

该算子根据给定键的值对样本进行分组,这些键可以是嵌套的。如果没有提供键,则默认使用文本键。它使用一种简单的分组策略来将具有相同键值的样本分批。生成的数据集是一个批次样本列表,每个批次包含具有相同键值的样本。这对于按特定属性或特征组织数据非常有用。

Type 算子类型: grouper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

group_by_keys

typing.Optional[typing.List[str]]

None

group samples according values in the keys.

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_key_value_grouper

KeyValueGrouper(['meta.language'])

📥 input data 输入数据

Sample 1: text
Today is Sunday and it's a happy day!
meta{'language': 'en'}
Sample 2: text
Welcome to Alibaba.
meta{'language': 'en'}
Sample 3: text
欢迎来到阿里巴巴!
meta{'language': 'zh'}

📤 output data 输出数据

Sample 1: empty
en["Today is Sunday and it's a happy day!", 'Welcome to Alibaba.']
zh['欢迎来到阿里巴巴!']

✨ explanation 解释

This example demonstrates how the KeyValueGrouper operator groups input samples based on the ‘language’ field in the ‘meta’ key. The operator batches together all English and Chinese texts separately, resulting in a dataset where each batch contains texts of the same language. 这个例子展示了KeyValueGrouper算子如何根据’meta’键中的’language’字段对输入样本进行分组。算子将所有英文和中文文本分别归类在一起,从而生成一个数据集,其中每个批次包含相同语言的文本。