data_juicer.ops.grouper¶

class data_juicer.ops.grouper.KeyValueGrouper(group_by_keys: List[str] | None = None, *args, **kwargs)[source]¶

Bases: Grouper

Groups samples into batches based on values in specified keys.

This operator groups samples by the values of the given keys, which can be nested. If no keys are provided, it defaults to using the text key. It uses a naive grouping strategy to batch samples with identical key values. The resulting dataset is a list of batched samples, where each batch contains samples that share the same key values. This is useful for organizing data by specific attributes or features.

__init__(group_by_keys: List[str] | None = None, *args, **kwargs)[source]¶

Initialization method.

Parameters:

group_by_keys – group samples according values in the keys. Support for nested keys such as “__dj__stats__.text_len”. It is [self.text_key] in default.
args – extra args
kwargs – extra args

process(dataset)[source]¶

Dataset –> dataset.

Parameters:: dataset – input dataset
Returns:: dataset of batched samples.

class data_juicer.ops.grouper.NaiveGrouper(*args, **kwargs)[source]¶

Bases: Grouper

Group all samples in a dataset into a single batched sample.

This operator takes a dataset and combines all its samples into one batched sample. If the input dataset is empty, it returns an empty dataset. The resulting batched sample is a dictionary where each key corresponds to a list of values from all samples in the dataset.

__init__(*args, **kwargs)[source]¶

Initialization method.

Parameters:

args – extra args
kwargs – extra args

process(dataset)[source]¶

Dataset –> dataset.

Parameters:: dataset – input dataset
Returns:: dataset of batched samples.

class data_juicer.ops.grouper.NaiveReverseGrouper(batch_meta_export_path=None, *args, **kwargs)[source]¶

Bases: Grouper

Split batched samples into individual samples.

This operator processes a dataset by splitting each batched sample into individual samples. It also handles and optionally exports batch metadata. - If a sample contains ‘batch_meta’, it is separated and can be exported to a specified path. - The operator converts the remaining data from a dictionary of lists to a list of dictionaries, effectively unbatching the samples. - If batch_meta_export_path is provided, the batch metadata is written to this file in JSON format, one entry per line. - If no samples are present in the dataset, the original dataset is returned.

__init__(batch_meta_export_path=None, *args, **kwargs)[source]¶

Initialization method.

Parameters:

batch_meta_export_path – the path to export the batch meta. Just drop the batch meta if it is None.
args – extra args
kwargs – extra args

process(dataset)[source]¶

Dataset –> dataset.

Parameters:: dataset – input dataset
Returns:: dataset of batched samples.