📄 [2025-10-20] We introduce Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling. A training-free framework that automatically discovers interpretable evaluation criteria from preference data, achieving SOTA performance with just 70 preference pairs (1.5% of source data) while providing human-readable "Theme-Tips" rubric hierarchies.
🚀 Key Features
- 🎯 Training-Free: No parameter updates required - works with any pre-trained LLM
- 📊 Data Efficient: Achieves SOTA performance using only ~70 preference pairs (1.5% of source data)
- 🔍 Interpretable: Generates human-readable "Theme-Tips" rubric hierarchies
- ⚡ Fast Convergence: Information-theoretic selection rapidly identifies optimal rubric sets
- 🌐 Cross-Model: Rubrics generalize across different LLM architectures
- 🔄 Modular Pipeline: Separate generation, structuring, and analysis components
📋 Table of Contents
🎓 Overview
What is Auto-Rubric?
Auto-Rubric is an automated framework that learns to extract generalizable evaluation criteria (called rubrics) from preference data.
A rubric is an explicit evaluation criterion that specifies what aspects to focus on when assessing response quality. For example: - "The better answer correctly identifies that the chessboard rotation issue stems from calculating the chessboard pattern using unrotated UV coordinates." - "Prioritize factual accuracy and avoid unsupported claims by strictly adhering to the information explicitly presented in the source text."
Instead of manually writing rubrics or training a neural reward model, Auto-Rubric automatically discovers the underlying criteria that distinguish good responses from bad ones, using a Propose-Evaluate-Revise loop combined with information-theoretic selection (MCR²).
How Auto-Rubric Works
The Auto-Rubric pipeline consists of three main stages:
1. Rubric Generation (Propose-Evaluate-Revise) - Propose: LLM generates candidate rubrics from preference pairs - Evaluate: Test rubrics against ground-truth preferences - Revise: Improve rubrics based on evaluation feedback - Iterate: Repeat until rubrics converge
2. MCR² Selection (Maximal Coding Rate Reduction) - Apply information-theoretic selection to maximize rubric diversity - Remove redundant or overlapping criteria - Select optimal subset that covers diverse evaluation aspects - Achieve high performance with minimal rubrics
3. Theme-Tips Structuring - Organize rubrics into hierarchical "Theme-Tips" format - Group related rubrics under semantic themes - Generate actionable tips for each theme - Produce human-readable evaluation framework
🚀 Quick Start
Navigate to the examples directory:
Option 1: Complete Auto-Rubric Pipeline
This will:
1. Generate rubrics from preference data
2. Apply MCR² selection for optimal rubric sets
3. Structure rubrics into Theme-Tips format
4. Export results to ./exports/{model_name}/
Option 2: Step-by-Step Pipeline
# Step 1: Generate Rubrics
./run_generator.sh
# Step 2: Structure into Theme-Tips
./run_structurer.sh
# Step 3: Analyze Performance
./run_analysis.sh
Quick Configuration
Edit the shell scripts to customize parameters:
run_autorubric.sh - Complete pipeline:
MODEL="qwen3-32b"
MAX_SAMPLES=200 # Adjust based on your data size
MAX_WORKERS=32 # Adjust based on your hardware
NUM_CATEGORIES=5 # Number of Theme-Tips categories
run_generator.sh - Rubric generation:
MAX_SAMPLES=200 # Number of samples to process
DOMAINS="general" # Filter by domain (or set to "" for all)
BATCH_SIZE=500 # Batch size for processing
🏗️ Pipeline Components
1. Complete Auto-Rubric Pipeline (auto_rubric.py)
The integrated pipeline combining generation, MCR² selection, and structuring:
# Run complete pipeline
python auto_rubric.py \
--data-path data/helpsteer3_preference_train.jsonl \
--model qwen3-32b \
--max-workers 32 \
--enable-structuring True \
--num-categories 5
Pipeline Stages: 1. Iterative Generation: Propose-Evaluate-Revise loop for rubric creation 2. MCR² Selection: Information-theoretic filtering for optimal rubric diversity 3. Theme-Tips Structuring: Hierarchical organization into interpretable categories 4. Export: Structured results ready for evaluation
2. Rubric Generation (generator.py)
Standalone rubric generation with checkpoint support:
# Generate rubrics with checkpointing
python generator.py \
--data-path data/helpsteer3_preference_train.jsonl \
--output-dir rubric_generation_output \
--model qwen3-32b \
--max-samples 200 \
--batch-size 500 \
--resume # Resume from checkpoint if interrupted
Key Features: - Checkpoint Support: Resume interrupted generation - Batch Processing: Efficient parallel processing - Domain Filtering: Focus on specific content domains - Iterative Refinement: Multi-epoch improvement cycles
3. Rubric Structuring (structurer.py)
Transform raw rubrics into Theme-Tips format:
# Structure rubrics into themes
python structurer.py \
--input rubric_generation_output/rubrics.json \
--output rubric_structuring_results \
--themes 5 \
--model qwen3-32b
Output Format (Theme-Tips):
Theme: Evaluate response accuracy and factual correctness
- Tip 1: Check for factual errors or misconceptions
- Tip 2: Verify claims against reliable sources
- Tip 3: Assess logical consistency of arguments
4. Performance Analysis (analysis.py)
Comprehensive evaluation of rubric performance:
# Analyze rubric performance
python analysis.py \
--rubrics rubric_structuring_results/ready_to_use_rubrics.json \
--dataset data/helpsteer3_preference_valid.jsonl \
--max-samples 100 \
--max-workers 256 \
--output rubric_analysis_results
Generated Metrics: - Coverage: Percentage of samples where rubrics provide clear preference - Precision: Accuracy of rubric predictions vs. ground truth - Contribution: Individual rubric impact on ensemble performance - Ensemble Accuracy: Overall performance of rubric set
⚙️ Configuration Guide
Complete Pipeline (auto_rubric.py)
| Parameter | Default | Description |
|---|---|---|
--model |
"qwen3-32b" |
LLM model for all operations |
--max-workers |
32 |
Concurrent threads for parallel processing |
--batch-size |
10 |
Samples processed per batch |
--max-epochs |
10 |
Maximum refinement iterations per sample |
--mcr-batch-size |
10 |
MCR² selection batch size |
--min-increment-threshold |
0.002 |
Information gain stopping threshold |
--patience |
2 |
Consecutive low increments before stopping |
--max-iterations |
50 |
Maximum pipeline iterations |
--max-total-rubrics |
200 |
Final rubric set size limit |
--enable-structuring |
True |
Enable Theme-Tips structuring |
--num-categories |
5 |
Number of Theme-Tips categories |
Rubric Generation (generator.py)
| Parameter | Default | Description |
|---|---|---|
--data-path |
Required | Path to preference dataset (JSONL) |
--model |
"qwen3-32b" |
LLM model for generation |
--max-samples |
200 |
Maximum samples to process (-1 for all) |
--domains |
None |
Filter by domain (e.g., "general", "multilingual") |
--batch-size |
500 |
Batch size for processing |
--max-epochs |
10 |
Maximum refinement epochs |
--max-workers |
256 |
Worker threads |
--max-retries |
5 |
Maximum retry attempts for LLM calls |
--resume |
Flag | Resume from checkpoint |
--disable-checkpoint |
Flag | Disable checkpoint saving |
Rubric Structuring (structurer.py)
| Parameter | Default | Description |
|---|---|---|
--input |
Required | Input rubrics JSON file |
--output |
"rubric_structuring_results" |
Output directory |
--model |
"qwen3-32b" |
LLM model for structuring |
--themes |
5 |
Number of themes to generate |
Performance Analysis (analysis.py)
| Parameter | Default | Description |
|---|---|---|
--rubrics |
Required | Path to rubrics JSON file |
--dataset |
"data/helpsteer3_preference_valid.jsonl" |
Validation dataset |
--model |
"qwen3-32b" |
Model for evaluation |
--max-samples |
100 |
Maximum samples for evaluation |
--max-workers |
256 |
Worker threads for parallel processing |
--source-rubrics |
Optional | Source rubrics for comparison |
📊 Data Format & Processing
Expected Input Format
Input preference data should be in JSONL format with the following structure:
{
"input": [{"role": "user", "content": "Your question here"}],
"output": [
{
"answer": {
"content": "Response A",
"label": {"preference": "chosen", "is_preferred": true}
}
},
{
"answer": {
"content": "Response B",
"label": {"preference": "rejected", "is_preferred": false}
}
}
],
"metadata": {
"domain": "general",
"overall_preference": 1,
"individual_preference": [
{"reasoning": "Response A is better because..."}
]
}
}
Key Fields
input: User query in message formatoutput: List of response candidates (typically 2 for pairwise comparison)preference: "chosen" or "rejected" labelsis_preferred: Boolean preference indicatordomain: Content domain for filtering (e.g., "general", "multilingual", "math")overall_preference: Numeric preference (-1, 0, 1)individual_preference: Optional reasoning for preferences
Data Loading & Conversion
For loading and converting data from various sources (HuggingFace datasets, local files, etc.), we provide a unified data loading framework. See the Data Loading Tutorial for comprehensive examples.
Quick Example - Load HelpSteer3 Preference Dataset:
from rm_gallery.core.data.load.base import create_loader
from rm_gallery.core.data.build import create_builder
import rm_gallery.core.data
import rm_gallery.gallery.data
# Load HelpSteer3 preference data
config = {
"path": "HelpSteer3/preference/train.jsonl",
"limit": 1000
}
load_module = create_loader(
name="helpsteer3_train",
load_strategy_type="local",
data_source="helpsteer3_preference", # Uses HelpSteer3PreferenceConverter
config=config
)
pipeline = create_builder(
name="load_pipeline",
load_module=load_module
)
result = pipeline.run()
print(f"Loaded {len(result)} samples")
# Each sample contains:
# - Multi-turn conversation input
# - Two response candidates with preference labels
# - Domain and language metadata
# - Overall preference scores (-3 to +3)
🔧 Advanced Usage
Checkpoint and Resume
The generation process supports checkpointing for long-running tasks:
# Enable resume in run_generator.sh
RESUME="--resume"
# Or disable checkpointing for faster processing
DISABLE_CHECKPOINT="--disable-checkpoint"
Checkpoint Files:
- checkpoint_samples.jsonl: Incremental progress save
- Resume automatically skips processed samples
- Safe interruption with Ctrl+C
Domain-Specific Generation
Filter training data by domain for specialized rubrics:
# In run_generator.sh, set domain filter
DOMAINS="general" # or "multilingual", "math", etc.
# Or process all domains
DOMAINS=""
Custom Analysis
Compare different rubric sets:
# Compare structured vs. raw rubrics
python analysis.py \
--rubrics rubric_structuring_results/ready_to_use_rubrics.json \
--source-rubrics rubric_generation_output/rubrics.json \
--output comparison_analysis
Note: This framework is designed for research and experimentation. For production deployment, conduct thorough validation on your specific use cases and datasets.