data_juicer.ops

data_juicer.ops.load_ops(process_list)[源代码]

Load op list according to the process list from config file.

参数:

process_list -- A process list. Each item is an op name and its arguments.

返回:

The op instance list.

class data_juicer.ops.Filter(*args, **kwargs)[源代码]

基类:OP

__init__(*args, **kwargs)[源代码]

Base class that removes specific info.

参数:
  • text_key -- the key name of field that stores sample texts to be processed

  • image_key -- the key name of field that stores sample image list to be processed

  • audio_key -- the key name of field that stores sample audio list to be processed

  • video_key -- the key name of field that stores sample video list to be processed

  • image_bytes_key -- the key name of field that stores sample image bytes list to be processed

  • query_key -- the key name of field that stores sample queries

  • response_key -- the key name of field that stores responses

  • history_key -- the key name of field that stores history of queries and responses

  • min_closed_interval -- whether the min_val of the specified filter range is a closed interval. It's True by default.

  • max_closed_interval -- whether the max_val of the specified filter range is a closed interval. It's True by default.

  • reversed_range -- whether to reverse the target range [min_val, max_val] to (-∞, min_val) or (max_val, +∞). It's False by default.

compute_stats_batched(samples, *args, **kwargs)[源代码]
compute_stats_single(sample, context=False)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

get_keep_boolean(val, min_val=None, max_val=None)[源代码]
process_batched(samples)[源代码]
process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering

run(dataset, *, exporter=None, tracer=None, reduce=True)[源代码]
class data_juicer.ops.Mapper(*args, **kwargs)[源代码]

基类:OP

__init__(*args, **kwargs)[源代码]

Base class that conducts data editing.

参数:
  • text_key -- the key name of field that stores sample texts to be processed.

  • image_key -- the key name of field that stores sample image list to be processed

  • audio_key -- the key name of field that stores sample audio list to be processed

  • video_key -- the key name of field that stores sample video list to be processed

  • image_bytes_key -- the key name of field that stores sample image bytes list to be processed

  • query_key -- the key name of field that stores sample queries

  • response_key -- the key name of field that stores responses

  • history_key -- the key name of field that stores history of queries and responses

process_batched(samples, *args, **kwargs)[源代码]
process_single(sample)[源代码]

For sample level, sample --> sample

参数:

sample -- sample to process

返回:

processed sample

run(dataset, *, exporter=None, tracer=None)[源代码]
class data_juicer.ops.Deduplicator(*args, **kwargs)[源代码]

基类:OP

__init__(*args, **kwargs)[源代码]

Base class that conducts deduplication.

参数:
  • text_key -- the key name of field that stores sample texts to be processed

  • image_key -- the key name of field that stores sample image list to be processed

  • audio_key -- the key name of field that stores sample audio list to be processed

  • video_key -- the key name of field that stores sample video list to be processed

  • image_bytes_key -- the key name of field that stores sample image bytes list to be processed

  • query_key -- the key name of field that stores sample queries

  • response_key -- the key name of field that stores responses

  • history_key -- the key name of field that stores history of queries and responses

compute_hash(sample)[源代码]

Compute hash values for the sample.

参数:

sample -- input sample

返回:

sample with computed hash value.

process(dataset, show_num=0)[源代码]

For doc-level, dataset --> dataset.

参数:
  • dataset -- input dataset

  • show_num -- number of traced samples used when tracer is open.

返回:

deduplicated dataset and the sampled duplicate pairs.

run(dataset, *, exporter=None, tracer=None, reduce=True)[源代码]
class data_juicer.ops.Selector(*args, **kwargs)[源代码]

基类:OP

__init__(*args, **kwargs)[源代码]

Base class that conducts selection in dataset-level.

参数:
  • text_key -- the key name of field that stores sample texts to be processed

  • image_key -- the key name of field that stores sample image list to be processed

  • audio_key -- the key name of field that stores sample audio list to be processed

  • video_key -- the key name of field that stores sample video list to be processed

  • image_bytes_key -- the key name of field that stores sample image bytes list to be processed

  • query_key -- the key name of field that stores sample queries

  • response_key -- the key name of field that stores responses

  • history_key -- the key name of field that stores history of queries and responses

process(dataset)[源代码]

Dataset --> dataset.

参数:

dataset -- input dataset

返回:

selected dataset.

run(dataset, *, exporter=None, tracer=None)[源代码]
class data_juicer.ops.Grouper(*args, **kwargs)[源代码]

基类:OP

__init__(*args, **kwargs)[源代码]

Base class that group samples.

参数:
  • text_key -- the key name of field that stores sample texts to be processed

  • image_key -- the key name of field that stores sample image list to be processed

  • audio_key -- the key name of field that stores sample audio list to be processed

  • video_key -- the key name of field that stores sample video list to be processed

  • image_bytes_key -- the key name of field that stores sample image bytes list to be processed

  • query_key -- the key name of field that stores sample queries

  • response_key -- the key name of field that stores responses

  • history_key -- the key name of field that stores history of queries and responses

process(dataset)[源代码]

Dataset --> dataset.

参数:

dataset -- input dataset

返回:

dataset of batched samples.

run(dataset, *, exporter=None, tracer=None)[源代码]
class data_juicer.ops.Aggregator(*args, **kwargs)[源代码]

基类:OP

__init__(*args, **kwargs)[源代码]

Base class that group samples.

参数:
  • text_key -- the key name of field that stores sample texts to be processed

  • image_key -- the key name of field that stores sample image list to be processed

  • audio_key -- the key name of field that stores sample audio list to be processed

  • video_key -- the key name of field that stores sample video list to be processed

  • image_bytes_key -- the key name of field that stores sample image bytes list to be processed

  • query_key -- the key name of field that stores sample queries

  • response_key -- the key name of field that stores responses

  • history_key -- the key name of field that stores history of queries and responses

process_single(sample)[源代码]

For sample level, batched sample --> sample, the input must be the output of some Grouper OP.

参数:

sample -- batched sample to aggregate

返回:

aggregated sample

run(dataset, *, exporter=None, tracer=None)[源代码]