data_juicer.ops.grouper.key_value_grouper module¶

class data_juicer.ops.grouper.key_value_grouper.KeyValueGrouper(group_by_keys: List[str] | None = None, *args, **kwargs)[source]¶

Bases: Grouper

Groups samples into batches based on values in specified keys.

This operator groups samples by the values of the given keys, which can be nested. If no keys are provided, it defaults to using the text key. It uses a naive grouping strategy to batch samples with identical key values. The resulting dataset is a list of batched samples, where each batch contains samples that share the same key values. This is useful for organizing data by specific attributes or features.

__init__(group_by_keys: List[str] | None = None, *args, **kwargs)[source]¶

Initialization method.

Parameters:

group_by_keys – group samples according values in the keys. Support for nested keys such as “__dj__stats__.text_len”. It is [self.text_key] in default.
args – extra args
kwargs – extra args

process(dataset)[source]¶

Dataset –> dataset.

Parameters:: dataset – input dataset
Returns:: dataset of batched samples.