Data Module¶
1. Overview¶
The Data Module provides a complete data processing solution covering the entire lifecycle from data loading, preprocessing, quality annotation to export. This module supports multiple operation modes and can flexibly combine different data processing components to meet various data processing scenario requirements.
2. System Architecture¶
The Data Module adopts a modular design consisting of five core components:
2.1. Load Module¶
- Supports local files and remote data sources (such as HuggingFace Hub)
- Supports multiple data formats: parquet, jsonl, json, etc.
- Built-in data source adapters: rewardbench, chatmessage, prmbench, etc.
- Supports data splitting and sampling limits
2.2. Process Module¶
- Configurable data processing pipeline
- Built-in filters: text length filtering, conversation turn filtering, etc.
- Integrated data-juicer advanced data cleaning operators
- Supports custom processor extensions
2.3. Annotation Module¶
- Deep integration with Label Studio annotation platform
- Supports multiple preset annotation templates
- Automatic project creation and configuration management
- Supports multi-user collaborative annotation
2.4. Export Module¶
- Multi-format data export: jsonl, parquet, json
- Intelligent data splitting (train/test sets)
- Maintains original data directory structure
2.5. Build Module¶
- Unified data pipeline orchestration
- Automatic inter-module data flow management
- Supports YAML configuration-based building
- Supports pipeline reuse and extension
3. Operation Modes¶
The Data Module supports two main operation methods: Python Script Mode and YAML Configuration Mode, catering to different user preferences.
3.1. Python Script Mode (data_pipeline.py)¶
For the complete pipeline script, please refer to ./examples/data/data_pipeline.py
3.1.1. Basic Data Processing Flow¶
Execute the complete data processing pipeline: Data Loading → Data Processing → Data Export
# Process 100 sample data points
python data_pipeline.py --mode basic --limit 100
3.1.2. Complete Flow with Annotation¶
Execute the complete pipeline including manual annotation: Data Loading → Data Processing → Data Annotation → Data Export
# Requires Label Studio API Token
python data_pipeline.py --mode annotation --api-token YOUR_LABEL_STUDIO_TOKEN
3.1.3. Independent Module Testing Mode¶
Supports testing individual module functionality:
- Load Only:
python data_pipeline.py --mode load-only
- Process Only:
python data_pipeline.py --mode process-only
- Export Only:
python data_pipeline.py --mode export-only
3.1.4. Annotation Data Export Mode¶
Export completed annotation data from Label Studio:
python data_pipeline.py --mode export-annotation \
--api-token YOUR_TOKEN \
--project-id PROJECT_ID
3.2 YAML Configuration Mode (data_from_yaml.py)¶
Run data pipelines through declarative YAML configuration files, more suitable for batch processing and production environments:
python data_from_yaml.py --config ./examples/data/config.yaml
4. Configuration File Details¶
YAML Configuration File Structure¶
The YAML configuration file provides a declarative pipeline configuration approach, supporting complete data processing flow definition. Here's the complete configuration file structure explanation:
dataset:
# Dataset basic information
name: rewardbench2 # Dataset name
# local mode: custom name (e.g., rewardbench2)
# huggingface mode: HF dataset name (e.g., allenai/reward-bench-2)
# Data source configuration
configs:
type: local # Data source type
# - local: Local file system
# - huggingface: HuggingFace Hub
source: rewardbench2 # Data source adapter identifier
# Note: Ensure corresponding converter is registered
path: /path/to/data.parquet # Data file path (local mode only)
huggingface_split: train # Data split name (huggingface mode only)
# Options: train, test, validation, etc.
limit: 2000 # Sample count limit (random sampling)
# Used for quick testing or data preview
# Data processor configuration (optional)
processors:
# Conversation turn filter
- type: filter
name: conversation_turn_filter
config:
min_turns: 1
max_turns: 6
# Text length filter
- type: filter
name: text_length_filter
config:
min_length: 10
max_length: 1000
# data-juicer operator example
- type: data_juicer
name: character_repetition_filter
config:
rep_len: 10
min_ratio: 0.0
max_ratio: 0.5
# Annotation configuration (optional)
annotation:
template_name: "rewardbench2" # Annotation template name
project_title: "Reward Bench Evaluation" # Label Studio project title
project_description: "Reward model evaluation using reward bench template from yaml"
server_url: "http://localhost:8080" # Label Studio server address
api_token: "your_api_token_here" # Label Studio API token
# Export configuration (required)
export:
output_dir: ./examples/data/exports # Export directory path
formats: ["jsonl"] # Export format list
# Supported: jsonl, parquet, json
preserve_structure: true # Whether to maintain original directory structure
split_ratio: {"train": 0.8, "test": 0.2} # Dataset split ratio
# Supports multiple splits: train/test, comment out if no splitting needed
# Metadata configuration (optional)
metadata:
source: "rewardbench2" # Data source identifier
version: "1.0" # Data version (optional)
description: "Sample dataset" # Data description (optional)
5. Reference Resources¶
Official Documentation¶
- Label Studio Official Guide: https://labelstud.io/guide/
- Data-Juicer Project Documentation: https://github.com/modelscope/data-juicer
- HuggingFace Datasets: https://huggingface.co/docs/datasets/