key_value_grouper¶

Groups samples into batches based on values in specified keys.

This operator groups samples by the values of the given keys, which can be nested. If no keys are provided, it defaults to using the text key. It uses a naive grouping strategy to batch samples with identical key values. The resulting dataset is a list of batched samples, where each batch contains samples that share the same key values. This is useful for organizing data by specific attributes or features.

根据指定键的值对样本进行分组。

该算子根据给定键的值对样本进行分组，这些键可以是嵌套的。如果没有提供键，则默认使用文本键。它使用一种简单的分组策略来将具有相同键值的样本分批。生成的数据集是一个批次样本列表，每个批次包含具有相同键值的样本。这对于按特定属性或特征组织数据非常有用。

Type 算子类型: grouper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置¶

name 参数名	type 类型	default 默认值	desc 说明
`group_by_keys`	typing.Optional[typing.List[str]]	`None`	group samples according values in the keys. Support for nested keys such as “dj__stats.text_len”. It is [self.text_key] in default.
`args`		`''`	extra args
`kwargs`		`''`	extra args

📊 Effect demonstration 效果演示¶

test_key_value_grouper¶

KeyValueGrouper(['meta.language'])

📥 input data 输入数据¶

Sample 1: text

Today is Sunday and it's a happy day!

meta
language	en

Sample 2: text

Welcome to Alibaba.

meta
language	en

Sample 3: text

欢迎来到阿里巴巴！

meta
language	zh

📤 output data 输出数据¶

Sample 1: empty

en
["Today is Sunday and it's a happy day!", 'Welcome to Alibaba.']
zh
['欢迎来到阿里巴巴！']

✨ explanation 解释¶

This example demonstrates how the KeyValueGrouper operator groups input samples based on the ‘language’ field in the ‘meta’ key. The operator batches together all English and Chinese texts separately, resulting in a dataset where each batch contains texts of the same language. 这个例子展示了KeyValueGrouper算子如何根据’meta’键中的’language’字段对输入样本进行分组。算子将所有英文和中文文本分别归类在一起，从而生成一个数据集，其中每个批次包含相同语言的文本。