data_juicer.ops.grouper.key_value_grouper module

class data_juicer.ops.grouper.key_value_grouper.KeyValueGrouper(group_by_keys: List[str] | None = None, *args, **kwargs)[源代码]

基类:Grouper

Groups samples into batches based on values in specified keys.

This operator groups samples by the values of the given keys, which can be nested. If no keys are provided, it defaults to using the text key. It uses a naive grouping strategy to batch samples with identical key values. The resulting dataset is a list of batched samples, where each batch contains samples that share the same key values. This is useful for organizing data by specific attributes or features.

__init__(group_by_keys: List[str] | None = None, *args, **kwargs)[源代码]

Initialization method.

参数:
  • group_by_keys -- group samples according values in the keys. Support for nested keys such as "__dj__stats__.text_len". It is [self.text_key] in default.

  • args -- extra args

  • kwargs -- extra args

process(dataset)[源代码]

Dataset --> dataset.

参数:

dataset -- input dataset

返回:

dataset of batched samples.