1. Overview

The data loading module provides a unified, flexible data loading interface that supports loading data from multiple data sources and converting them to standardized formats. This module is located in the rm_gallery/core/data/load/ directory.

2. Core Architecture

Design Patterns

  • Strategy Pattern: Supports different data loading strategies
  • FileDataLoadStrategy: Local file loading
  • HuggingFaceDataLoadStrategy: HuggingFace dataset loading

  • Registry Pattern: Dynamic registration and management of data converters

  • DataConverterRegistry: Converter registry center
  • Supports runtime registration of new data format converters

  • Template Method Pattern: Unified data conversion interface

  • DataConverter: Abstract converter base class
  • Various concrete converters implement specific format conversion logic

3. Supported Data Sources

Local Files

  • Supported Formats: JSON (.json), JSONL (.jsonl), Parquet (.parquet)
  • Core Features:
  • Automatic file type detection
  • Batch file loading
  • Recursive directory scanning

Hugging Face Datasets

  • Data Source: Hugging Face Hub public datasets
  • Core Features:
  • Streaming data loading
  • Flexible configuration options
  • Support for dataset sharding

4. Built-in Data Converters

ChatMessageConverter (chat_message)

Specifically handles chat conversation format data:

{
    "messages": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hello! How can I help you?"}
    ]
}

GenericConverter (*)

Generic converter that automatically recognizes common fields:

{
    "prompt": "User input",      # Supported fields: question, input, text, instruction
    "response": "Model reply"    # Supported fields: answer, output, completion
}

Supported Benchmark Datasets

Currently built-in support for converters for the following benchmark datasets (located in rm_gallery/gallery/data/load/):

  • rewardbench
  • rewardbench2
  • helpsteer2
  • prmbench
  • rmbbenchmark_bestofn
  • rmbbenchmark_pairwise

Each dataset has a corresponding dedicated converter that can correctly handle its specific data format and field structure.

5. Quick Start

Local File Loading

# Implementation by creating factory function
from rm_gallery.core.data.load.base import create_loader
from rm_gallery.core.data.build import create_builder
import rm_gallery.core.data     # Core strategy registration
import rm_gallery.gallery.data  # Extended strategy registration

config = {
    "path": "../../../data/reward-bench-2/data/test-00000-of-00001.parquet",
    "limit": 1000,  # Limit the number of data items to load
}

# Create loading module
load_module = create_loader(
    name="rewardbench2",
    load_strategy_type="local",
    data_source="rewardbench2",
    config=config
)
# Create complete pipeline
pipeline = create_builder(
    name="load_pipeline",
    load_module=load_module
)

# Run pipeline
result = pipeline.run()
print(f"Successfully loaded {len(result)} data items")
Successfully loaded 1000 data items

Hugging Face Dataset Loading

# Implementation by creating factory function
from rm_gallery.core.data.load.base import create_loader
from rm_gallery.core.data.build import create_builder

config = {
    "huggingface_split": "test",        # Dataset split (train/test/validation)
    "limit": 1000,          # Limit the number of data items to load
    "streaming": False      # Whether to use streaming loading
}

# Create loading module
load_module = create_loader(
    name="allenai/reward-bench-2",
    load_strategy_type="huggingface",
    data_source="rewardbench",
    config=config
)
# Create complete pipeline
pipeline = create_builder(
    name="load_pipeline",
    load_module=load_module
)

# Run pipeline
result = pipeline.run()
print(f"Successfully loaded {len(result)} data items")

Data Export

Built-in data export capabilities supporting multiple format data export: jsonl, parquet, json, and splitting into training and test sets.

from rm_gallery.core.data.load.base import create_loader
from rm_gallery.core.data.build import create_builder
from rm_gallery.core.data.export import create_exporter
import rm_gallery.core.data     # Core strategy registration
import rm_gallery.gallery.data  # Extended strategy registration


config = {
    "path": "../../../data/reward-bench-2/data/test-00000-of-00001.parquet",
    "limit": 1000,  # Limit the number of data items to load
}

# Create loading module
load_module = create_loader(
    name="rewardbench2",
    load_strategy_type="local",
    data_source="rewardbench2",
    config=config
)

export_module = create_exporter(
    name="rewardbench2",
    config={
        "output_dir": "./exports",
        "formats": ["jsonl"],
        "split_ratio": {"train": 0.8, "test": 0.2}
    }
)
# Create complete pipeline
pipeline = create_builder(
    name="load_pipeline",
    load_module=load_module,
    export_module=export_module
)

# Run pipeline
result = pipeline.run()
print(f"Successfully loaded {len(result)} data items")
2025-07-02 12:26:34.230 | INFO     | rm_gallery.core.data.build:run:85 - Starting data build pipeline: load_pipeline
2025-07-02 12:26:34.232 | INFO     | rm_gallery.core.data.build:run:97 - Stage: Loading
2025-07-02 12:26:34.669 | INFO     | rm_gallery.core.data.load.base:_load_data_impl:392 - Loaded 1865 samples from file: ../../../data/reward-bench-2/data/test-00000-of-00001.parquet
2025-07-02 12:26:34.670 | INFO     | rm_gallery.core.data.load.base:run:262 - Applied limit of 1000, final count: 1000
2025-07-02 12:26:34.670 | INFO     | rm_gallery.core.data.load.base:run:276 - Successfully loaded 1000 items from rewardbench2
2025-07-02 12:26:34.673 | INFO     | rm_gallery.core.data.build:run:99 - Loading completed: 1000 items
2025-07-02 12:26:34.674 | INFO     | rm_gallery.core.data.build:run:97 - Stage: Export
2025-07-02 12:26:34.675 | INFO     | rm_gallery.core.data.export:_split_dataset:381 - Individual split: 800 training samples, 200 test samples
2025-07-02 12:26:34.859 | INFO     | rm_gallery.core.data.export:_export_jsonl:452 - Exported to JSONL: exports/rewardbench2_train.jsonl
2025-07-02 12:26:34.908 | INFO     | rm_gallery.core.data.export:_export_jsonl:452 - Exported to JSONL: exports/rewardbench2_test.jsonl
2025-07-02 12:26:34.908 | INFO     | rm_gallery.core.data.export:run:138 - Successfully exported 1000 samples to exports
2025-07-02 12:26:34.908 | INFO     | rm_gallery.core.data.build:run:99 - Export completed: 1000 items
2025-07-02 12:26:34.909 | INFO     | rm_gallery.core.data.build:run:101 - Pipeline completed: 1000 items processed


Successfully loaded 1000 data items

6. Data Output Format

BaseDataSet Structure

All loaded data is encapsulated as a BaseDataSet object:

BaseDataSet(
    name="dataset_name",           # Dataset name
    metadata={                     # Metadata information
        "source": "data_source",
        "strategy_type": "local|huggingface",
        "config": {...}
    },
    datasamples=[DataSample(...), ...]   # List of standardized data samples
)

DataSample Structure

Each data sample is uniformly converted to DataSample format:

DataSample(
    unique_id="md5_hash_id",        # Unique identifier for the data
    input=[                         # Input message list
        ChatMessage(role="user", content="...")
    ],
    output=[                        # Output data list
        DataOutput(answer=Step(...))
    ],
    source="data_source_name",      # Data source name
    task_category="chat|qa|instruction_following|general",  # Task category
    metadata={                      # Detailed metadata
        "raw_data": {...},          # Raw data
        "load_strategy": "ConverterName",  # Converter used
        "source_file_path": "...",  # Source file path (local files)
        "dataset_name": "...",      # Dataset name (HF datasets)
        "load_type": "local|huggingface"   # Loading method
    }
)

7. Custom Data Converters

If you need to support new data formats, you can create custom converters by following these steps:

Step 1: Implement Converter Class

Create a converter file in the rm_gallery/gallery/data/load/ directory:

from rm_gallery.core.data.load.base import DataConverter, DataConverterRegistry

@DataConverterRegistry.register("custom_format")
class CustomConverter(DataConverter):
    """Custom data format converter"""

    def convert_to_data_sample(self, data_dict, source_info):
        """
        Convert raw data to DataSample format

        Args:
            data_dict: Raw data dictionary
            source_info: Data source information

        Returns:
            DataSample: Standardized data sample
        """
        # Implement specific conversion logic
        return DataSample(...)

Step 2: Register Converter

Import the converter in rm_gallery/gallery/data/__init__.py to complete registration:

from rm_gallery.gallery.data.load.custom_format import CustomConverter