data_juicer.core.adapter module

class data_juicer.core.adapter.Adapter(cfg: Namespace)[source]

Bases: object

MAX_BATCH_SIZE = 10000
__init__(cfg: Namespace)[source]
static execute_and_probe(dataset, operators, sample_interval=0.5)[source]

Process the input dataset and probe related information for each OP in the specified operator list.

For now, we support the following targets to probe: “resource”: resource utilization for each OP. “speed”: average processing speed for each OP.

The probe result is a list and each item in the list is the probe result for each OP.

static take_batch(dataset, config)[source]

Split the dataset into batches based on configuration and load factor.

Parameters:
  • dataset – The dataset to be split

  • config – Configuration settings, including batch size

Returns:

An iterator of batches

adapt_workloads(dataset, operators)[source]

Manage the scheduling and load balancing for the dataset processing.

Parameters:
  • dataset – The dataset that needs to be processed

  • operators – Operators in the data recipe

probe_small_batch(dataset, operators)[source]

Perform small batch pre-execution to probe available resources, current load and estimated OP speed, returning load factors and speed ranks for each OP.

Notice: the probe should be run with cache enabled to avoid removing the cache files of the input dataset.

Parameters:
  • dataset – The dataset to pre-execute small batch on

  • operators – The OP list to be pre-execution and probe

Returns:

A list of probe results for each OP and the length of data batch to probe.

batch_size_strategy(load_analysis_res, base_bs=1, util_th=0.9)[source]

Decide the batch size for each op according to their workload analysis result and expected utilization threshold. We need to guarantee that the resource utilization won’t exceed the threshold. Now we only consider the buckets effect, which means the max batch size is decided by the max utilization of all types of resources except GPU util (decided by num_proc).

analyze_small_batch(dataset, current_state)[source]

Perform small batch analysis to probe the current OP-wise stats/meta distributions. The analyzed results will be stored in the directory {work_dir}/insight_mining.

Notice: the probe should be run with cache enabled to avoid removing the cache files of the input dataset.

Parameters:
  • dataset – The dataset to analyze small batch on

  • current_state – A string to indicate the current state of the input dataset. It usually consists of a number of the index of the OP processed just now and the OP name, e.g. “1_text_length_filter”.

insight_mining(pval_th=0.05)[source]

Mining the insights from the OP-wise analysis results. For now, we use T-Test to check the significance of stats/meta changes before and after each OP processing. If the p-value is less than a given threshold (usually 0.05), we think the stats/meta changes are significant. The insight mining results will be stored in the file {work_dir}/insight_mining/insight_mining.json.

Parameters:

pval_th – the threshold of p-value.