data_juicer.ops package

Subpackages

Submodules

data_juicer.ops.base_op module

data_juicer.ops.base_op.convert_list_dict_to_dict_list(samples)[source]
data_juicer.ops.base_op.convert_dict_list_to_list_dict(samples)[source]
data_juicer.ops.base_op.convert_arrow_to_python(method)[source]
data_juicer.ops.base_op.catch_map_batches_exception(method)[source]

For batched-map sample-level fault tolerance.

data_juicer.ops.base_op.catch_map_single_exception(method, return_sample=True)[source]

For single-map sample-level fault tolerance. The input sample is expected batch_size = 1.

class data_juicer.ops.base_op.OP(*args, **kwargs)[source]

Bases: object

__init__(*args, **kwargs)[source]

Base class of operators.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed.

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

  • index_key – index the samples before process if not None

is_batched_op()[source]
process(*args, **kwargs)[source]
use_cuda()[source]
runtime_np()[source]
remove_extra_parameters(param_dict, keys=None)[source]

at the begining of the init of the mapper op, call self.remove_extra_parameters(locals()) to get the init parameter dict of the op for convenience

add_parameters(init_parameter_dict, **extra_param_dict)[source]

add parameters for each sample, need to keep extra_param_dict and init_parameter_dict unchanged.

run(dataset)[source]
empty_history()[source]
class data_juicer.ops.base_op.Mapper(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that conducts data editing.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed.

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process_batched(samples, *args, **kwargs)[source]
process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

run(dataset, *, exporter=None, tracer=None)[source]
class data_juicer.ops.base_op.Filter(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that removes specific info.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

compute_stats_batched(samples, *args, **kwargs)[source]
process_batched(samples)[source]
compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

run(dataset, *, exporter=None, tracer=None, reduce=True)[source]
class data_juicer.ops.base_op.Deduplicator(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that conducts deduplication.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

compute_hash(sample)[source]

Compute hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with computed hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

run(dataset, *, exporter=None, tracer=None, reduce=True)[source]
class data_juicer.ops.base_op.Selector(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that conducts selection in dataset-level.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

selected dataset.

run(dataset, *, exporter=None, tracer=None)[source]
class data_juicer.ops.base_op.Grouper(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that group samples.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

dataset of batched samples.

run(dataset, *, exporter=None, tracer=None)[source]
class data_juicer.ops.base_op.Aggregator(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that group samples.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process_single(sample)[source]

For sample level, batched sample –> sample, the input must be the output of some Grouper OP.

Parameters:

sample – batched sample to aggregate

Returns:

aggregated sample

run(dataset, *, exporter=None, tracer=None)[source]

data_juicer.ops.load module

data_juicer.ops.load.load_ops(process_list)[source]

Load op list according to the process list from config file.

Parameters:

process_list – A process list. Each item is an op name and its arguments.

Returns:

The op instance list.

data_juicer.ops.op_fusion module

data_juicer.ops.op_fusion.fuse_operators(ops, probe_res=None)[source]

Fuse the input ops list and return the fused ops list.

Parameters:
  • ops – the corresponding list of op objects.

  • probe_res – the probed speed for each OP from Monitor.

Returns:

a list of fused op objects.

data_juicer.ops.op_fusion.fuse_filter_group(original_filter_group)[source]

Fuse single filter group and return the fused filter group.

Parameters:

original_filter_group – the original filter group, including op definitions and objects.

Returns:

the fused definitions and objects of the input filter group.

class data_juicer.ops.op_fusion.FusedFilter(name: str, fused_filters: List)[source]

Bases: Filter

A fused operator for filters.

__init__(name: str, fused_filters: List)[source]

Initialization method.

Parameters:

fused_filters – a list of filters to be fused.

compute_stats_batched(samples, rank=None)[source]
process_batched(samples)[source]

Module contents

data_juicer.ops.load_ops(process_list)[source]

Load op list according to the process list from config file.

Parameters:

process_list – A process list. Each item is an op name and its arguments.

Returns:

The op instance list.

class data_juicer.ops.Filter(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that removes specific info.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

compute_stats_batched(samples, *args, **kwargs)[source]
process_batched(samples)[source]
compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

run(dataset, *, exporter=None, tracer=None, reduce=True)[source]
class data_juicer.ops.Mapper(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that conducts data editing.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed.

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process_batched(samples, *args, **kwargs)[source]
process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

run(dataset, *, exporter=None, tracer=None)[source]
class data_juicer.ops.Deduplicator(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that conducts deduplication.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

compute_hash(sample)[source]

Compute hash values for the sample.

Parameters:

sample – input sample

Returns:

sample with computed hash value.

process(dataset, show_num=0)[source]

For doc-level, dataset –> dataset.

Parameters:
  • dataset – input dataset

  • show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

run(dataset, *, exporter=None, tracer=None, reduce=True)[source]
class data_juicer.ops.Selector(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that conducts selection in dataset-level.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

selected dataset.

run(dataset, *, exporter=None, tracer=None)[source]
class data_juicer.ops.Grouper(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that group samples.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

dataset of batched samples.

run(dataset, *, exporter=None, tracer=None)[source]
class data_juicer.ops.Aggregator(*args, **kwargs)[source]

Bases: OP

__init__(*args, **kwargs)[source]

Base class that group samples.

Parameters:
  • text_key – the key name of field that stores sample texts to be processed

  • image_key – the key name of field that stores sample image list to be processed

  • audio_key – the key name of field that stores sample audio list to be processed

  • video_key – the key name of field that stores sample video list to be processed

  • query_key – the key name of field that stores sample queris

  • response_key – the key name of field that stores responses

  • history_key – the key name of field that stores history of queries and responses

process_single(sample)[source]

For sample level, batched sample –> sample, the input must be the output of some Grouper OP.

Parameters:

sample – batched sample to aggregate

Returns:

aggregated sample

run(dataset, *, exporter=None, tracer=None)[source]