Data Loading Module¶
1. Overview¶
The data loading module provides a unified, flexible data loading interface that supports loading data from multiple data sources and converting them to standardized formats. This module is located in the rm_gallery/core/data/load/
directory.
2. Core Architecture¶
Design Patterns¶
Strategy Pattern: Supports different data loading strategies
FileDataLoadStrategy
: Local file loadingHuggingFaceDataLoadStrategy
: HuggingFace dataset loading
Registry Pattern: Dynamic registration and management of data converters
DataConverterRegistry
: Converter registry center- Supports runtime registration of new data format converters
Template Method Pattern: Unified data conversion interface
DataConverter
: Abstract converter base class- Various concrete converters implement specific format conversion logic
3. Supported Data Sources¶
Local Files¶
- Supported Formats: JSON (
.json
), JSONL (.jsonl
), Parquet (.parquet
) - Core Features:
- Automatic file type detection
- Batch file loading
- Recursive directory scanning
Hugging Face Datasets¶
- Data Source: Hugging Face Hub public datasets
- Core Features:
- Streaming data loading
- Flexible configuration options
- Support for dataset sharding
4. Built-in Data Converters¶
ChatMessageConverter (chat_message
)¶
Specifically handles chat conversation format data:
{
"messages": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I help you?"}
]
}
GenericConverter (*
)¶
Generic converter that automatically recognizes common fields:
{
"prompt": "User input", # Supported fields: question, input, text, instruction
"response": "Model reply" # Supported fields: answer, output, completion
}
Supported Benchmark Datasets¶
Currently built-in support for converters for the following benchmark datasets (located in rm_gallery/gallery/data/load/
):
- rewardbench
- rewardbench2
- helpsteer2
- prmbench
- rmbbenchmark_bestofn
- rmbbenchmark_pairwise
Each dataset has a corresponding dedicated converter that can correctly handle its specific data format and field structure.
5. Quick Start¶
Local File Loading¶
# Implementation by creating factory function
from rm_gallery.core.data.load.base import create_loader
from rm_gallery.core.data.build import create_builder
import rm_gallery.core.data # Core strategy registration
import rm_gallery.gallery.data # Extended strategy registration
config = {
"path": "../../../data/reward-bench-2/data/test-00000-of-00001.parquet",
"limit": 1000, # Limit the number of data items to load
}
# Create loading module
load_module = create_loader(
name="rewardbench2",
load_strategy_type="local",
data_source="rewardbench2",
config=config
)
# Create complete pipeline
pipeline = create_builder(
name="load_pipeline",
load_module=load_module
)
# Run pipeline
result = pipeline.run()
print(f"Successfully loaded {len(result)} data items")
Successfully loaded 1000 data items
Hugging Face Dataset Loading¶
# Implementation by creating factory function
from rm_gallery.core.data.load.base import create_loader
from rm_gallery.core.data.build import create_builder
config = {
"huggingface_split": "test", # Dataset split (train/test/validation)
"limit": 1000, # Limit the number of data items to load
"streaming": False # Whether to use streaming loading
}
# Create loading module
load_module = create_loader(
name="allenai/reward-bench-2",
load_strategy_type="huggingface",
data_source="rewardbench",
config=config
)
# Create complete pipeline
pipeline = create_builder(
name="load_pipeline",
load_module=load_module
)
# Run pipeline
result = pipeline.run()
print(f"Successfully loaded {len(result)} data items")
Data Export¶
Built-in data export capabilities supporting multiple format data export: jsonl, parquet, json, and splitting into training and test sets.
from rm_gallery.core.data.load.base import create_loader
from rm_gallery.core.data.build import create_builder
from rm_gallery.core.data.export import create_exporter
import rm_gallery.core.data # Core strategy registration
import rm_gallery.gallery.data # Extended strategy registration
config = {
"path": "../../../data/reward-bench-2/data/test-00000-of-00001.parquet",
"limit": 1000, # Limit the number of data items to load
}
# Create loading module
load_module = create_loader(
name="rewardbench2",
load_strategy_type="local",
data_source="rewardbench2",
config=config
)
export_module = create_exporter(
name="rewardbench2",
config={
"output_dir": "./exports",
"formats": ["jsonl"],
"split_ratio": {"train": 0.8, "test": 0.2}
}
)
# Create complete pipeline
pipeline = create_builder(
name="load_pipeline",
load_module=load_module,
export_module=export_module
)
# Run pipeline
result = pipeline.run()
print(f"Successfully loaded {len(result)} data items")
2025-07-02 12:26:34.230 | INFO | rm_gallery.core.data.build:run:85 - Starting data build pipeline: load_pipeline 2025-07-02 12:26:34.232 | INFO | rm_gallery.core.data.build:run:97 - Stage: Loading 2025-07-02 12:26:34.669 | INFO | rm_gallery.core.data.load.base:_load_data_impl:392 - Loaded 1865 samples from file: ../../../data/reward-bench-2/data/test-00000-of-00001.parquet 2025-07-02 12:26:34.670 | INFO | rm_gallery.core.data.load.base:run:262 - Applied limit of 1000, final count: 1000 2025-07-02 12:26:34.670 | INFO | rm_gallery.core.data.load.base:run:276 - Successfully loaded 1000 items from rewardbench2 2025-07-02 12:26:34.673 | INFO | rm_gallery.core.data.build:run:99 - Loading completed: 1000 items 2025-07-02 12:26:34.674 | INFO | rm_gallery.core.data.build:run:97 - Stage: Export 2025-07-02 12:26:34.675 | INFO | rm_gallery.core.data.export:_split_dataset:381 - Individual split: 800 training samples, 200 test samples 2025-07-02 12:26:34.859 | INFO | rm_gallery.core.data.export:_export_jsonl:452 - Exported to JSONL: exports/rewardbench2_train.jsonl 2025-07-02 12:26:34.908 | INFO | rm_gallery.core.data.export:_export_jsonl:452 - Exported to JSONL: exports/rewardbench2_test.jsonl 2025-07-02 12:26:34.908 | INFO | rm_gallery.core.data.export:run:138 - Successfully exported 1000 samples to exports 2025-07-02 12:26:34.908 | INFO | rm_gallery.core.data.build:run:99 - Export completed: 1000 items 2025-07-02 12:26:34.909 | INFO | rm_gallery.core.data.build:run:101 - Pipeline completed: 1000 items processed
Successfully loaded 1000 data items
6. Data Output Format¶
BaseDataSet Structure¶
All loaded data is encapsulated as a BaseDataSet
object:
BaseDataSet(
name="dataset_name", # Dataset name
metadata={ # Metadata information
"source": "data_source",
"strategy_type": "local|huggingface",
"config": {...}
},
datasamples=[DataSample(...), ...] # List of standardized data samples
)
DataSample Structure¶
Each data sample is uniformly converted to DataSample
format:
DataSample(
unique_id="md5_hash_id", # Unique identifier for the data
input=[ # Input message list
ChatMessage(role="user", content="...")
],
output=[ # Output data list
DataOutput(answer=Step(...))
],
source="data_source_name", # Data source name
task_category="chat|qa|instruction_following|general", # Task category
metadata={ # Detailed metadata
"raw_data": {...}, # Raw data
"load_strategy": "ConverterName", # Converter used
"source_file_path": "...", # Source file path (local files)
"dataset_name": "...", # Dataset name (HF datasets)
"load_type": "local|huggingface" # Loading method
}
)
7. Custom Data Converters¶
If you need to support new data formats, you can create custom converters by following these steps:
Step 1: Implement Converter Class¶
Create a converter file in the rm_gallery/gallery/data/load/
directory:
from rm_gallery.core.data.load.base import DataConverter, DataConverterRegistry
@DataConverterRegistry.register("custom_format")
class CustomConverter(DataConverter):
"""Custom data format converter"""
def convert_to_data_sample(self, data_dict, source_info):
"""
Convert raw data to DataSample format
Args:
data_dict: Raw data dictionary
source_info: Data source information
Returns:
DataSample: Standardized data sample
"""
# Implement specific conversion logic
return DataSample(...)
Step 2: Register Converter¶
Import the converter in rm_gallery/gallery/data/__init__.py
to complete registration:
from rm_gallery.gallery.data.load.custom_format import CustomConverter