data_juicer.ops.grouper¶
- class data_juicer.ops.grouper.KeyValueGrouper(group_by_keys: List[str] | None = None, *args, **kwargs)[源代码]¶
基类:
Grouper
Groups samples into batches based on values in specified keys.
This operator groups samples by the values of the given keys, which can be nested. If no keys are provided, it defaults to using the text key. It uses a naive grouping strategy to batch samples with identical key values. The resulting dataset is a list of batched samples, where each batch contains samples that share the same key values. This is useful for organizing data by specific attributes or features.
- class data_juicer.ops.grouper.NaiveGrouper(*args, **kwargs)[源代码]¶
基类:
Grouper
Group all samples in a dataset into a single batched sample.
This operator takes a dataset and combines all its samples into one batched sample. If the input dataset is empty, it returns an empty dataset. The resulting batched sample is a dictionary where each key corresponds to a list of values from all samples in the dataset.
- class data_juicer.ops.grouper.NaiveReverseGrouper(batch_meta_export_path=None, *args, **kwargs)[源代码]¶
基类:
Grouper
Split batched samples into individual samples.
This operator processes a dataset by splitting each batched sample into individual samples. It also handles and optionally exports batch metadata. - If a sample contains 'batch_meta', it is separated and can be exported to a specified path. - The operator converts the remaining data from a dictionary of lists to a list of dictionaries, effectively unbatching the samples. - If batch_meta_export_path is provided, the batch metadata is written to this file in JSON format, one entry per line. - If no samples are present in the dataset, the original dataset is returned.