data_juicer.core.data

class data_juicer.core.data.DJDataset[source]

Bases: ABC

Base dataset of DJ

abstract process(operators, *, exporter=None, checkpointer=None, tracer=None) DJDataset[source]

process a list of operators on the dataset.

abstract schema() Schema[source]

Get dataset schema.

Returns:

Dataset schema containing column names and types

Return type:

Schema

abstract get(k: int) List[Dict[str, Any]][source]

Get k rows from the dataset.

Parameters:

k – Number of rows to take

Returns:

A list of rows from the dataset.

Return type:

List[Any]

abstract get_column(column: str, k: int | None = None) List[Any][source]

Get values from a specific column/field, optionally limited to first k rows.

Parameters:
  • column – Name of the column to retrieve

  • k – Optional number of rows to return. If None, returns all rows

Returns:

List of values from the specified column

Raises:
  • KeyError – If column doesn’t exist in dataset

  • ValueError – If k is negative

class data_juicer.core.data.NestedDataset(*args, **kargs)[source]

Bases: Dataset, DJDataset

Enhanced HuggingFace-Dataset for better usability and efficiency.

__init__(*args, **kargs)[source]
schema() Schema[source]

Get dataset schema.

get(k: int) List[Dict[str, Any]][source]

Get k rows from the dataset.

get_column(column: str, k: int | None = None) List[Any][source]

Get column values from HuggingFace dataset.

Parameters:
  • column – Name of the column to retrieve

  • k – Optional number of rows to return. If None, returns all rows

Returns:

List of values from the specified column

Raises:
  • KeyError – If column doesn’t exist

  • ValueError – If k is negative

process(operators, *, work_dir=None, exporter=None, checkpointer=None, tracer=None, adapter=None, open_monitor=True)[source]

process a list of operators on the dataset.

update_args(args, kargs, is_filter=False)[source]
map(*args, **kargs)[source]

Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

filter(*args, **kargs)[source]

Override the filter func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

select(*args, **kargs)[source]

Override the select func, such that selected samples can be accessed by nested manner.

classmethod from_dict(*args, **kargs)[source]

Override the from_dict func, which is called by most from_xx constructors, such that the constructed dataset object is NestedDataset.

add_column(*args, **kargs)[source]

Override the add column func, such that the processed samples can be accessed by nested manner.

select_columns(*args, **kargs)[source]

Override the select columns func, such that the processed samples can be accessed by nested manner.

remove_columns(*args, **kargs)[source]

Override the remove columns func, such that the processed samples can be accessed by nested manner.

cleanup_cache_files()[source]

Override the cleanup_cache_files func, clear raw and compressed cache files.

static load_from_disk(*args, **kargs)[source]

Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

Parameters:
  • dataset_path (path-like) – Path (e.g. “dataset/train”) or remote URI (e.g. “s3//my-bucket/dataset/train”) of the dataset directory where the dataset will be loaded from.

  • keep_in_memory (bool, defaults to None) – Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the [improve performance](../cache#improve-performance) section.

  • storage_options (dict, optional) –

    Key/value pairs to be passed on to the file-system backend, if any.

    <Added version=”2.8.0”/>

Returns:

  • If dataset_path is a path of a dataset directory, the dataset requested.

  • If dataset_path is a path of a dataset dict directory, a datasets.DatasetDict with each split.

Return type:

[Dataset] or [DatasetDict]

Example:

`py >>> ds = load_from_disk("path/to/dataset/directory") `

data_juicer.core.data.wrap_func_with_nested_access(f)[source]

Before conducting actual function f, wrap its args and kargs into nested ones.

Parameters:

f – function to be wrapped.

Returns:

wrapped function

data_juicer.core.data.add_same_content_to_new_column(sample, new_column_name, initial_value=None)[source]

A helper function to speed up add_column function. Apply map on this function in parallel instead of using add_column. :param sample: a single sample to add this new column/field. :param new_column_name: the name of this new column/field. :param initial_value: the initial value of this new column/field.