# key_value_grouper Groups samples into batches based on values in specified keys. This operator groups samples by the values of the given keys, which can be nested. If no keys are provided, it defaults to using the text key. It uses a naive grouping strategy to batch samples with identical key values. The resulting dataset is a list of batched samples, where each batch contains samples that share the same key values. This is useful for organizing data by specific attributes or features. 根据指定键的值对样本进行分组。 该算子根据给定键的值对样本进行分组,这些键可以是嵌套的。如果没有提供键,则默认使用文本键。它使用一种简单的分组策略来将具有相同键值的样本分批。生成的数据集是一个批次样本列表,每个批次包含具有相同键值的样本。这对于按特定属性或特征组织数据非常有用。 Type 算子类型: **grouper** Tags 标签: cpu, text ## 🔧 Parameter Configuration 参数配置 | name 参数名 | type 类型 | default 默认值 | desc 说明 | |--------|------|--------|------| | `group_by_keys` | typing.Optional[typing.List[str]] | `None` | group samples according values in the keys. | | `args` | | `''` | extra args | | `kwargs` | | `''` | extra args | ## 📊 Effect demonstration 效果演示 ### test_key_value_grouper ```python KeyValueGrouper(['meta.language']) ``` #### 📥 input data 输入数据
Sample 1: text
Today is Sunday and it's a happy day!
meta{'language': 'en'}
Sample 2: text
Welcome to Alibaba.
meta{'language': 'en'}
Sample 3: text
欢迎来到阿里巴巴!
meta{'language': 'zh'}
#### 📤 output data 输出数据
Sample 1: empty
en["Today is Sunday and it's a happy day!", 'Welcome to Alibaba.']
zh['欢迎来到阿里巴巴!']
#### ✨ explanation 解释 This example demonstrates how the KeyValueGrouper operator groups input samples based on the 'language' field in the 'meta' key. The operator batches together all English and Chinese texts separately, resulting in a dataset where each batch contains texts of the same language. 这个例子展示了KeyValueGrouper算子如何根据'meta'键中的'language'字段对输入样本进行分组。算子将所有英文和中文文本分别归类在一起,从而生成一个数据集,其中每个批次包含相同语言的文本。 ## 🔗 related links 相关链接 - [source code 源代码](../../../data_juicer/ops/grouper/key_value_grouper.py) - [unit test 单元测试](../../../tests/ops/grouper/test_key_value_grouper.py) - [Return operator list 返回算子列表](../../Operators.md)