Dataset | Twinkle

Basic Dataset Components

Mon, 01 Jan 0001 00:00:00 +0000

DatasetMeta

Open-source community datasets can be defined by three fields:

Dataset name: Represents the dataset ID, e.g., swift/self-cognition.
Subset name: A dataset may contain multiple subsets, and each subset may have a different format.
Subset split: Common splits include train/test, etc., used for training, validation, etc.

Using the Hugging Face community’s datasets library, you can see an example of loading a dataset:

from datasets import load_dataset
train_data = load_dataset("glue", "mrpc", split="train")

In Twinkle’s dataset input, the DatasetMeta class is used to express the input data format. This class contains:

@dataclass
class DatasetMeta:
 dataset_id: str
 subset_name: str = 'default'
 split: str = 'train'
 data_slice: Iterable = None

The first three fields correspond to the dataset name, subset name, and split respectively. The fourth field data_slice is the data range to be selected, for example:

dataset_meta = DatasetMeta(..., data_slice=range(100))

When using this class, developers don’t need to worry about data_slice going out of bounds. Twinkle will perform repeated sampling based on the dataset length.

Note: data_slice has no effect on streaming datasets.

Dataset

Twinkle’s Dataset is a lightweight wrapper around the actual dataset, including operations such as downloading, loading, mixing, preprocessing, and encoding.

Loading datasets

from twinkle.dataset import Dataset, DatasetMeta

dataset = Dataset(DatasetMeta(dataset_id='ms://swift/self-cognition', data_slice=range(1500)))

The ms:// prefix of the dataset represents downloading from the ModelScope community. If replaced with hf://, it will download from the Hugging Face community. If there is no prefix, it defaults to downloading from the Hugging Face community. You can also pass a local path:

from twinkle.dataset import Dataset, DatasetMeta

dataset = Dataset(DatasetMeta(dataset_id='my/custom/dataset.jsonl', data_slice=range(1500)))

If using a local path or a local file, please follow these instructions:

If you are using a local dataset file, pass a single file path (better to be an absolute path to avoid relative path errors), list is not supported.
If you are using a local dir, please make sure all files in the path have the same data structure and file extension.
We use datasets library to do data loading, check the support extensions .
Setting template

The Template component is responsible for converting string/image multimodal raw data into model input tokens. The dataset can set a Template to complete the encode process.

dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B', max_length=512)

The set_template method supports passing kwargs (such as max_length in the example) to be used as constructor parameters for Template.

Adding datasets

dataset.add_dataset(DatasetMeta(dataset_id='ms://xxx/xxx', data_slice=range(1000)))

add_dataset can add other datasets on top of existing datasets and subsequently call mix_dataset to mix them together.

Preprocessing data

The data preprocessing (ETL) process is an important workflow for data cleaning and standardization. For example:

{
 "query": "some query here",
 "response": "some response with extra info",
}

In this raw data, the response may contain non-standard information. Before starting training, the response needs to be filtered and fixed, and replaced with Twinkle’s standard format. So you can write a method to process the corresponding data:

from twinkle.data_format import Trajectory, Message
from twinkle.dataset import DatasetMeta
def preprocess_row(row):
 query = row['query']
 response = row['response']
 if not query or not response:
 return None
 # Fix response
 response = _do_some_fix_on_response(response)
 return Trajectory(
 messages=[
 Message(role='user', content=query),
 Message(role='assistant', content=response)
 ]
 )

dataset.map(preprocess_row, dataset_meta=DatasetMeta(dataset_id='ms://xxx/xxx'))

Tips:

Currently, the map interface of Dataset does not support batched=True mode

If a row has a problem, return None, and dataset.map will automatically filter empty rows

Different datasets may have different preprocessing methods, so an additional dataset_meta parameter needs to be passed. If the add_dataset method has not been called, i.e., there is only one dataset in the Dataset, this parameter can be omitted

Similarly, Dataset provides a filter method:

def filter_row(row):
 if ...:
 return False
 else:
 return True

dataset.filter(filter_row, dataset_meta=DatasetMeta(dataset_id='ms://xxx/xxx'))

Mixing datasets

After adding multiple datasets to the Dataset, you need to use mix_dataset to mix them.

dataset.mix_dataset()

Encoding dataset

Before inputting to the model, the dataset must go through tokenization and encoding to be converted into tokens. This process is usually completed by the tokenizer component. However, in current large model training processes, tokenizer is generally not used directly. This is because model training requires preparation of additional fields, and simply performing the tokenizer.encode process is not sufficient. In Twinkle, encoding the dataset is completed by the Template component. We have already described how to set up Template above. Now you can directly encode:

dataset.encode()

Dataset’s map, encode, filter, and other methods all use the map method of datasets, so you can use the corresponding parameters in the kwargs of the corresponding methods

The load_from_cache_file parameter defaults to False, because when this parameter is set to True, it can cause headaches when the dataset changes but training still uses the cache. If your dataset is large and updated infrequently, you can directly set it to True

encode does not need to specify DatasetMeta because after preprocessing, all datasets have the same format

encode tokenizes with a single process by default. For large datasets, enable multi-process parallelism via num_proc, e.g. dataset.encode(num_proc=8)

Getting data

Like ordinary datasets, Twinkle’s Dataset can use data through indexing.

trajectory = dataset[0]
length = len(dataset)

Remote execution support

The Dataset class is marked with the @remote_class decorator, so it can run in Ray:

dataset = Dataset(..., remote_group='actor_group')
# The following methods will run on Ray workers
dataset.map(...)

The Ray execution of the Dataset component is in first mode, meaning only one worker process runs and loads.

The overall dataset usage workflow is:

Construct the dataset, passing in the remote_group parameter if running in a Ray worker

Set template

Preprocess data

If multiple datasets are added, mix the data

Encode data

Lazy Loading Dataset

Mon, 01 Jan 0001 00:00:00 +0000

LazyDataset is a variant of Dataset that defers expensive operations (preprocessing, encoding) to __getitem__ time, preventing OOM for large or multimodal datasets.

Key Differences from Dataset

Operation	Dataset	LazyDataset
`map`	Executes immediately on all data	Records the operation, applies per-item in `__getitem__`
`filter`	Executes immediately	Executes immediately (same as Dataset, index mapping required)
`mix_dataset`	Merges datasets immediately	Records strategy, resolves indices lazily
`encode`	Encodes all data immediately	Records flag, encodes per-item in `__getitem__`

Lazy Map

When you call map, LazyDataset records the preprocessing function instead of applying it eagerly:

from twinkle.dataset import LazyDataset, DatasetMeta

dataset = LazyDataset(DatasetMeta(dataset_id='ms://xxx/xxx'))
dataset.add_dataset(DatasetMeta(dataset_id='ms://yyy/yyy'))

# Per-dataset preprocessing (before mix)
dataset.map(preprocess_fn_a, dataset_meta=DatasetMeta(dataset_id='ms://xxx/xxx'))
dataset.map(preprocess_fn_b, dataset_meta=DatasetMeta(dataset_id='ms://yyy/yyy'))

dataset.mix_dataset()

# Global preprocessing (after mix, applies to all items)
dataset.map(global_preprocess_fn)

Before mix: map is recorded per-dataset, so different datasets can have different preprocessing pipelines.
After mix: map is recorded globally and applies to all items regardless of source dataset.
All map operations are applied lazily in __getitem__ in the order they were registered.

Lazy Mix

mix_dataset supports two strategies:

dataset.mix_dataset(interleave=True) # Round-robin interleaving (default)
dataset.mix_dataset(interleave=False) # Concatenation

Interleave: Items cycle through datasets in round-robin order. Shorter datasets wrap around.
Concatenate: Items are accessed sequentially — all of dataset A, then all of dataset B.

Lazy Encode

Calling encode only marks the dataset for encoding. The actual template.encode() call happens inside __getitem__:

dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B', max_length=512)
dataset.encode()

Note: truncation_strategy='split' is not supported in LazyDataset because splitting may produce multiple outputs from a single item.

Eager Filter

Unlike other operations, filter executes immediately because it needs to build the index mapping of valid items upfront:

dataset.filter(filter_fn, dataset_meta=DatasetMeta(dataset_id='ms://xxx/xxx'))

Remote Execution

LazyDataset has the @remote_class decorator and can run in Ray workers, just like Dataset.

Fixed-Length Packing Dataset

Mon, 01 Jan 0001 00:00:00 +0000

Packing datasets are used to concatenate variable-length data to a specified length. For example:

The dataset contains 4 pieces of data with length 5, and the Template component’s max_length can accept a length of 10. The packing dataset will pre-fetch the data and concatenate it into 2 samples with length 10.

ABCDE
FGHIJ
KLMNO
PQRST

Will be converted to

ABCDEFGHIJ
KLMNOPQRST

Note that this concatenation occurs after encode, i.e., on the actual model input length. In the process, the dataset will perform the following operations:

Fetch buffer length samples
Encode these samples
Calculate based on the length of each sample using an automatic packing algorithm to find an optimal solution that minimizes the number of batches and makes the length of each sample closest to max_length
Add a position_ids field to distinguish different samples.

The final data format is similar to:

{
 "input_ids": [1,2,3,4,5,6,7,8,9,10],
 "position_ids": [0,1,2,3,4,0,1,2,3,4],
 ...
}

The use of the dataset has the following differences from Dataset:

Must set Template
After calling encode, you need to call the pack_dataset method for final packing

dataset.pack_dataset()

This dataset also has the @remote_class decorator and can run in Ray workers.

Streaming Dataset

Mon, 01 Jan 0001 00:00:00 +0000

Streaming datasets are used to load datasets in a streaming manner, generally used for ultra-large-scale datasets or multimodal datasets to save memory usage. Streaming datasets have no index and length, and can only be accessed through iterators.

Twinkle’s streaming dataset methods are the same as Dataset. However, since it does not provide __getitem__ and __len__ methods, streaming datasets need to use next for access:

from twinkle.dataset import IterableDataset, DatasetMeta

dataset = IterableDataset(DatasetMeta(...))
trajectory = next(dataset)

Streaming datasets also have the @remote_class decorator and can run in Ray workers.

Streaming Fixed-Length Packing Dataset

Mon, 01 Jan 0001 00:00:00 +0000

IterablePackingDataset is the same as PackingDataset, both used for automatic concatenation and packing of datasets. The difference is that IterablePackingDataset is adapted for streaming reading in large datasets or multimodal scenarios.

This dataset also requires an additional call to pack_dataset() to enable the packing process.

dataset.pack_dataset()

This dataset also has the @remote_class decorator and can run in Ray workers.