data_juicer.core.data¶

class data_juicer.core.data.DJDataset[source]¶

Bases: ABC

Base dataset of DJ

abstract process(operators, *, exporter=None, checkpointer=None, tracer=None) → DJDataset[source]¶: process a list of operators on the dataset.

abstract schema() → Schema[source]¶

Get dataset schema.

Returns:: Dataset schema containing column names and types
Return type:: Schema

abstract get(k: int) → List[Dict[str, Any]][source]¶

Get k rows from the dataset.

Parameters:: k – Number of rows to take
Returns:: A list of rows from the dataset.
Return type:: List[Any]

abstract get_column(column: str, k: int | None = None) → List[Any][source]¶

Get values from a specific column/field, optionally limited to first k rows.

Parameters:

column – Name of the column to retrieve
k – Optional number of rows to return. If None, returns all rows

Returns:

List of values from the specified column

Raises:

KeyError – If column doesn’t exist in dataset
ValueError – If k is negative

class data_juicer.core.data.NestedDataset(*args, **kargs)[source]¶

Bases: Dataset, DJDataset

Enhanced HuggingFace-Dataset for better usability and efficiency.

__init__(*args, **kargs)[source]¶

schema() → Schema[source]¶: Get dataset schema.

get(k: int) → List[Dict[str, Any]][source]¶: Get k rows from the dataset.

get_column(column: str, k: int | None = None) → List[Any][source]¶

Get column values from HuggingFace dataset.

Parameters:

column – Name of the column to retrieve
k – Optional number of rows to return. If None, returns all rows

Returns:

List of values from the specified column

Raises:

KeyError – If column doesn’t exist
ValueError – If k is negative

process(operators, *, work_dir=None, exporter=None, checkpointer=None, tracer=None, adapter=None, open_monitor=True)[source]¶: process a list of operators on the dataset.

update_args(args, kargs, is_filter=False)[source]¶

map(*args, **kargs)[source]¶: Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

filter(*args, **kargs)[source]¶: Override the filter func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

select(*args, **kargs)[source]¶: Override the select func, such that selected samples can be accessed by nested manner.

classmethod from_dict(*args, **kargs)[source]¶: Override the from_dict func, which is called by most from_xx constructors, such that the constructed dataset object is NestedDataset.

add_column(*args, **kargs)[source]¶: Override the add column func, such that the processed samples can be accessed by nested manner.

select_columns(*args, **kargs)[source]¶: Override the select columns func, such that the processed samples can be accessed by nested manner.

remove_columns(*args, **kargs)[source]¶: Override the remove columns func, such that the processed samples can be accessed by nested manner.

cleanup_cache_files()[source]¶: Override the cleanup_cache_files func, clear raw and compressed cache files.

static load_from_disk(*args, **kargs)[source]¶

Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

Parameters:

dataset_path (path-like) – Path (e.g. “dataset/train”) or remote URI (e.g. “s3//my-bucket/dataset/train”) of the dataset directory where the dataset will be loaded from.
keep_in_memory (bool, defaults to None) – Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the [improve performance](../cache#improve-performance) section.
storage_options (dict, optional) –
Key/value pairs to be passed on to the file-system backend, if any.

<Added version=”2.8.0”/>

Returns:

If dataset_path is a path of a dataset directory, the dataset requested.
If dataset_path is a path of a dataset dict directory, a datasets.DatasetDict with each split.

Return type:

[Dataset] or [DatasetDict]

Example:

`py >>> ds = load_from_disk("path/to/dataset/directory") `

data_juicer.core.data.wrap_func_with_nested_access(f)[source]¶

Before conducting actual function f, wrap its args and kargs into nested ones.

Parameters:: f – function to be wrapped.
Returns:: wrapped function

data_juicer.core.data.add_same_content_to_new_column(sample, new_column_name, initial_value=None)[source]¶: A helper function to speed up add_column function. Apply map on this function in parallel instead of using add_column. :param sample: a single sample to add this new column/field. :param new_column_name: the name of this new column/field. :param initial_value: the initial value of this new column/field.