data_juicer.ops¶

data_juicer.ops.load_ops(process_list)[source]¶

Load op list according to the process list from config file.

Parameters:: process_list – A process list. Each item is an op name and its arguments.
Returns:: The op instance list.

class data_juicer.ops.Filter(*args, **kwargs)[source]¶

Bases: OP

__init__(*args, **kwargs)[source]¶

Base class that removes specific info.

Parameters:

text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses

compute_stats_batched(samples, *args, **kwargs)[source]¶

process_batched(samples)[source]¶

compute_stats_single(sample, context=False)[source]¶

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:

sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]¶

For sample level, sample –> Boolean.

Parameters:: sample – sample to decide whether to filter
Returns:: true for keeping and false for filtering

run(dataset, *, exporter=None, tracer=None, reduce=True)[source]¶

class data_juicer.ops.Mapper(*args, **kwargs)[source]¶

Bases: OP

__init__(*args, **kwargs)[source]¶

Base class that conducts data editing.

Parameters:

text_key – the key name of field that stores sample texts to be processed.
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses

process_batched(samples, *args, **kwargs)[source]¶

process_single(sample)[source]¶

For sample level, sample –> sample

Parameters:: sample – sample to process
Returns:: processed sample

run(dataset, *, exporter=None, tracer=None)[source]¶

class data_juicer.ops.Deduplicator(*args, **kwargs)[source]¶

Bases: OP

__init__(*args, **kwargs)[source]¶

Base class that conducts deduplication.

Parameters:

text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses

compute_hash(sample)[source]¶

Compute hash values for the sample.

Parameters:: sample – input sample
Returns:: sample with computed hash value.

process(dataset, show_num=0)[source]¶

For doc-level, dataset –> dataset.

Parameters:

dataset – input dataset
show_num – number of traced samples used when tracer is open.

Returns:

deduplicated dataset and the sampled duplicate pairs.

run(dataset, *, exporter=None, tracer=None, reduce=True)[source]¶

class data_juicer.ops.Selector(*args, **kwargs)[source]¶

Bases: OP

__init__(*args, **kwargs)[source]¶

Base class that conducts selection in dataset-level.

Parameters:

text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses

process(dataset)[source]¶

Dataset –> dataset.

Parameters:: dataset – input dataset
Returns:: selected dataset.

run(dataset, *, exporter=None, tracer=None)[source]¶

class data_juicer.ops.Grouper(*args, **kwargs)[source]¶

Bases: OP

__init__(*args, **kwargs)[source]¶

Base class that group samples.

Parameters:

text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses

process(dataset)[source]¶

Dataset –> dataset.

Parameters:: dataset – input dataset
Returns:: dataset of batched samples.

run(dataset, *, exporter=None, tracer=None)[source]¶

class data_juicer.ops.Aggregator(*args, **kwargs)[source]¶

Bases: OP

__init__(*args, **kwargs)[source]¶

Base class that group samples.

Parameters:

text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses

process_single(sample)[source]¶

For sample level, batched sample –> sample, the input must be the output of some Grouper OP.

Parameters:: sample – batched sample to aggregate
Returns:: aggregated sample

run(dataset, *, exporter=None, tracer=None)[source]¶