data_juicer.ops.grouper.key_value_grouper module

class data_juicer.ops.grouper.key_value_grouper.KeyValueGrouper(group_by_keys: List[str] | None = None, *args, **kwargs)[source]

Bases: Grouper

Groups samples into batches based on values in specified keys.

This operator groups samples by the values of the given keys, which can be nested. If no keys are provided, it defaults to using the text key. It uses a naive grouping strategy to batch samples with identical key values. The resulting dataset is a list of batched samples, where each batch contains samples that share the same key values. This is useful for organizing data by specific attributes or features.

__init__(group_by_keys: List[str] | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • group_by_keys – group samples according values in the keys. Support for nested keys such as “__dj__stats__.text_len”. It is [self.text_key] in default.

  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

dataset of batched samples.