data_juicer.ops.grouper package

Submodules

data_juicer.ops.grouper.key_value_grouper module

class data_juicer.ops.grouper.key_value_grouper.KeyValueGrouper(group_by_keys: List[str] | None = None, *args, **kwargs)[source]

Bases: Grouper

Group samples to batched samples according values in given keys.

__init__(group_by_keys: List[str] | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • group_by_keys – group samples according values in the keys. Support for nested keys such as “__dj__stats__.text_len”. It is [self.text_key] in default.

  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

dataset of batched samples.

data_juicer.ops.grouper.naive_grouper module

class data_juicer.ops.grouper.naive_grouper.NaiveGrouper(*args, **kwargs)[source]

Bases: Grouper

Group all samples to one batched sample.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

dataset of batched samples.

Module contents

class data_juicer.ops.grouper.NaiveGrouper(*args, **kwargs)[source]

Bases: Grouper

Group all samples to one batched sample.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

dataset of batched samples.

class data_juicer.ops.grouper.KeyValueGrouper(group_by_keys: List[str] | None = None, *args, **kwargs)[source]

Bases: Grouper

Group samples to batched samples according values in given keys.

__init__(group_by_keys: List[str] | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • group_by_keys – group samples according values in the keys. Support for nested keys such as “__dj__stats__.text_len”. It is [self.text_key] in default.

  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

dataset of batched samples.