data_juicer.core package

Submodules

data_juicer.core.adapter module

data_juicer.core.analyzer module

data_juicer.core.data module

data_juicer.core.executor module

data_juicer.core.exporter module

data_juicer.core.monitor module

data_juicer.core.monitor.resource_monitor(mdict, interval)[source]
class data_juicer.core.monitor.Monitor[source]

Bases: object

Monitor resource utilization and other information during the data processing.

Resource utilization dict: (for each func) ‘’’python {

‘time’: 10, ‘sampling interval’: 0.5, ‘resource’: [

{

‘timestamp’: xxx, ‘CPU count’: xxx, ‘GPU free mem.’: xxx. …

}, {

‘timestamp’: xxx, ‘CPU count’: xxx, ‘GPU free mem.’: xxx, …

},

]

}

Based on the structure above, the resource utilization analysis result will add several extra fields on the first level: ‘’’python {

‘time’: 10, ‘sampling interval’: 0.5, ‘resource’: […], ‘resource_analysis’: {

‘GPU free mem.’: {

‘max’: xxx, ‘min’: xxx, ‘avg’: xxx,

}

}

Only those fields in DYNAMIC_FIELDS will be analyzed.

DYNAMIC_FIELDS = {'Available mem.', 'CPU util.', 'Free mem.', 'GPU free mem.', 'GPU used mem.', 'GPU util.', 'Mem. util.', 'Used mem.'}
__init__()[source]
monitor_all_resources()[source]

Detect the resource utilization of all distributed nodes.

static monitor_current_resources()[source]

Detect the resource utilization of the current environment/machine. All data of “util.” is ratios in the range of [0.0, 1.0]. All data of “mem.” is in MB.

static draw_resource_util_graph(resource_util_list, store_dir)[source]
static analyze_resource_util_list(resource_util_list)[source]

Analyze the resource utilization for a given resource util list. Compute {‘max’, ‘min’, ‘avg’} of resource metrics for each dict item.

static analyze_single_resource_util(resource_util_dict)[source]

Analyze the resource utilization for a single resource util dict. Compute {‘max’, ‘min’, ‘avg’} of each resource metrics.

static monitor_func(func, args=None, sample_interval=0.5)[source]

Process the input dataset and probe related information for each OP in the specified operator list.

For now, we support the following targets to probe: “resource”: resource utilization for each OP. “speed”: average processing speed for each OP.

The probe result is a list and each item in the list is the probe result for each OP.

data_juicer.core.ray_data module

data_juicer.core.ray_executor module

data_juicer.core.tracer module

Module contents