data_juicer.core.data

class data_juicer.core.data.DJDataset[源代码]

基类:ABC

Base dataset of DJ

contain_column(column: str) bool[源代码]

Check whether the dataset contains a specific column/field.

参数:

column -- Name of the column to check

返回:

True if the dataset contains the column, False otherwise

返回类型:

bool

abstractmethod get(k: int) List[Dict[str, Any]][源代码]

Get k rows from the dataset.

参数:

k -- Number of rows to take

返回:

A list of rows from the dataset.

返回类型:

List[Any]

abstractmethod get_column(column: str, k: int | None = None) List[Any][源代码]

Get values from a specific column/field, optionally limited to first k rows.

参数:
  • column -- Name of the column to retrieve

  • k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:
  • KeyError -- If column doesn't exist in dataset

  • ValueError -- If k is negative

abstractmethod process(operators, *, exporter=None, checkpointer=None, tracer=None) DJDataset[源代码]

process a list of operators on the dataset.

abstractmethod schema() Schema[源代码]

Get dataset schema.

返回:

Dataset schema containing column names and types

返回类型:

Schema

abstractmethod to_list() list[源代码]

Convert the current dataset to a Python list.

class data_juicer.core.data.NestedDataset(*args, **kargs)[源代码]

基类:Dataset, DJDataset

Enhanced HuggingFace-Dataset for better usability and efficiency.

__init__(*args, **kargs)[源代码]
add_column(*args, **kargs)[源代码]

Override the add column func, such that the processed samples can be accessed by nested manner.

cleanup_cache_files()[源代码]

Override the cleanup_cache_files func, clear raw and compressed cache files.

filter(*args, **kargs)[源代码]

Override the filter func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

classmethod from_dict(*args, **kargs)[源代码]

Override the from_dict func, which is called by most from_xx constructors, such that the constructed dataset object is NestedDataset.

get(k: int) List[Dict[str, Any]][源代码]

Get k rows from the dataset.

get_column(column: str, k: int | None = None) List[Any][源代码]

Get column values from HuggingFace dataset.

参数:
  • column -- Name of the column to retrieve

  • k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:
  • KeyError -- If column doesn't exist

  • ValueError -- If k is negative

static load_from_disk(*args, **kargs)[源代码]

Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

参数:
  • dataset_path (path-like) -- Path (e.g. "dataset/train") or remote URI (e.g. "s3//my-bucket/dataset/train") of the dataset directory where the dataset will be loaded from.

  • keep_in_memory (bool, defaults to None) -- Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the [improve performance](../cache#improve-performance) section.

  • storage_options (dict, optional) --

    Key/value pairs to be passed on to the file-system backend, if any.

    <Added version="2.8.0"/>

返回:

  • If dataset_path is a path of a dataset directory, the dataset requested.

  • If dataset_path is a path of a dataset dict directory, a datasets.DatasetDict with each split.

返回类型:

[Dataset] or [DatasetDict]

Example:

`py >>> ds = load_from_disk("path/to/dataset/directory") `

map(*args, **kargs)[源代码]

Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

process(operators, *, work_dir=None, exporter=None, checkpointer=None, tracer=None, adapter=None, open_monitor=True)[源代码]

process a list of operators on the dataset.

remove_columns(*args, **kargs)[源代码]

Override the remove columns func, such that the processed samples can be accessed by nested manner.

schema() Schema[源代码]

Get dataset schema.

select(*args, **kargs)[源代码]

Override the select func, such that selected samples can be accessed by nested manner.

select_columns(*args, **kargs)[源代码]

Override the select columns func, such that the processed samples can be accessed by nested manner.

to_list() list[源代码]

Returns the dataset as a Python list.

返回:

list

Example:

`py >>> ds.to_list() `

update_args(args, kargs, is_filter=False)[源代码]
data_juicer.core.data.wrap_func_with_nested_access(f)[源代码]

Before conducting actual function f, wrap its args and kargs into nested ones.

参数:

f -- function to be wrapped.

返回:

wrapped function

data_juicer.core.data.add_same_content_to_new_column(sample, new_column_name, initial_value=None)[源代码]

A helper function to speed up add_column function. Apply map on this function in parallel instead of using add_column. :param sample: a single sample to add this new column/field. :param new_column_name: the name of this new column/field. :param initial_value: the initial value of this new column/field.