Overview
RM-Bench is a specialized benchmark for evaluating reward models of language models with a focus on subtlety and style. Unlike traditional benchmarks that only consider binary comparisons, RM-Bench evaluates models across multiple response styles and formats, providing a more nuanced assessment of reward model capabilities.
The benchmark introduces a unique 3x3 matrix evaluation approach that compares responses with different levels of style sophistication: - Concise style: Brief, direct responses - Detailed plain text: Comprehensive responses in plain format - Detailed markdown: Well-formatted responses with markdown styling
Key Features
- Style-Aware Evaluation: Assesses model preferences across different response formats
- Multi-Domain Coverage: Includes chat, code, math, safety-refuse, and safety-response domains
- Sophisticated Metrics: Three accuracy types (hard, normal, easy) for comprehensive analysis
- Parallel Processing: Efficient evaluation with concurrent response comparisons
- Matrix-Based Analysis: 3x3 comparison matrix for detailed preference patterns
Quick Start
Step 1: Download Dataset
# Download the RM-Bench dataset
# Note: Ensure you have the dataset file at the specified path
mkdir -p data/benchmarks
cd data/benchmarks
git clone https://github.com/THU-KEG/RM-Bench.git
cd ../../
# Download total_dataset.json to data/benchmarks/RM-Bench/
# You can get the dataset from the official repository or other sources
Step 2: Verify Installation
# Check if the module can be imported
python -c "from rm_gallery.gallery.evaluation.rmbench import main; print('RM-Bench evaluation module loaded successfully')"
Step 3: Basic Usage
# Run evaluation on a sample dataset
python rm_gallery/gallery/evaluation/rmbench.py \
--data_path="data/benchmarks/RM-Bench/total_dataset.json" \
--result_path="data/results/rmbench_results.json" \
--max_samples=10 \
--model="gpt-4o-mini"
Step 4: Check Results
Installation and Environment Setup
Prerequisites
Environment Variables
Set up your API keys:
# For OpenAI models
export OPENAI_API_KEY="your-api-key"
# For other providers
export ANTHROPIC_API_KEY="your-anthropic-key"
export DEEPSEEK_API_KEY="your-deepseek-key"
Verify Setup
# Test model connection
python -c "from rm_gallery.core.model.openai_llm import OpenaiLLM; llm = OpenaiLLM(model='gpt-4o-mini'); print('Model connection successful')"
Usage Examples
Basic Evaluation
Evaluate a small subset of the RM-Bench dataset:
python rm_gallery/gallery/evaluation/rmbench.py \
--data_path="data/benchmarks/RM-Bench/total_dataset.json" \
--result_path="data/results/rmbench_basic.json" \
--max_samples=50 \
--model="gpt-4o-mini" \
--max_workers=4
Full Dataset Evaluation
Run evaluation on the complete dataset:
python rm_gallery/gallery/evaluation/rmbench.py \
--data_path="data/benchmarks/RM-Bench/total_dataset.json" \
--result_path="data/results/rmbench_full.json" \
--max_samples=1000 \
--model="gpt-4o" \
--max_workers=8
High-Performance Evaluation
For large-scale evaluation with maximum parallelism:
# Each sample's 9 comparisons run in parallel internally
# With 16 workers, theoretical max parallelism = 16 samples * 9 comparisons = 144 concurrent API calls
python rm_gallery/gallery/evaluation/rmbench.py \
--data_path="data/benchmarks/RM-Bench/total_dataset.json" \
--result_path="data/results/rmbench_performance.json" \
--max_samples=2000 \
--model="claude-3-5-sonnet-20241022" \
--max_workers=16
Model Comparison
Compare different models on the same dataset:
# Evaluate GPT-4o
python rm_gallery/gallery/evaluation/rmbench.py \
--data_path="data/benchmarks/RM-Bench/total_dataset.json" \
--result_path="data/results/rmbench_gpt4o.json" \
--model="gpt-4o" \
--max_samples=200
# Evaluate Claude
python rm_gallery/gallery/evaluation/rmbench.py \
--data_path="data/benchmarks/RM-Bench/total_dataset.json" \
--result_path="data/results/rmbench_claude.json" \
--model="claude-3-5-sonnet-20241022" \
--max_samples=200
# Evaluate Qwen
python rm_gallery/gallery/evaluation/rmbench.py \
--data_path="data/benchmarks/RM-Bench/total_dataset.json" \
--result_path="data/results/rmbench_qwen.json" \
--model="qwen3-32b" \
--max_samples=200
Configuration Parameters
Command Line Arguments
| Parameter | Type | Default | Description |
|---|---|---|---|
data_path |
str | "data/benchmarks/RM-Bench/total_dataset.json" |
Path to RM-Bench dataset file |
result_path |
str | "data/results/rmbench.json" |
Output file path for results |
max_samples |
int | 10 |
Maximum number of samples to evaluate |
model |
str/dict | "qwen3-32b" |
Model identifier or configuration dictionary |
max_workers |
int | 8 |
Number of parallel processing workers |
Advanced Configuration
For advanced usage with custom model parameters:
# Custom model configuration
python rm_gallery/gallery/evaluation/rmbench.py \
--data_path="data/benchmarks/RM-Bench/total_dataset.json" \
--result_path="data/results/rmbench_custom.json" \
--model='{"model": "gpt-4o", "temperature": 0.1, "max_tokens": 2048, "timeout": 120}' \
--max_samples=100 \
--max_workers=6
Dataset Structure
The RM-Bench dataset follows a specific format:
{
"id": "unique_sample_identifier",
"prompt": "Input prompt given to the model",
"chosen": [
"resp_1", // Chosen response with concise style
"resp_2", // Chosen response with detailed plain text style
"resp_3" // Chosen response with detailed markdown style
],
"rejected": [
"resp_1", // Rejected response with concise style
"resp_2", // Rejected response with detailed plain text style
"resp_3" // Rejected response with detailed markdown style
],
"domain": "chat" // Domain: chat, code, math, safety-refuse, safety-response
}
Style Categories
- Concise (Index 0): Brief, direct responses
- Detailed Plain (Index 1): Comprehensive responses in plain text format
- Detailed Markdown (Index 2): Well-formatted responses with markdown styling
Parallel Evaluation Architecture
Two-Level Parallelism
RM-Bench implements an optimized two-tier parallel evaluation system:
Level 1: Sample-Level Parallelism
- Multiple samples are processed concurrently
- Controlled by the
max_workersparameter (default: 8) - Each worker handles one complete sample evaluation
Level 2: Comparison-Level Parallelism
- Within each sample, all 9 comparisons (3×3 matrix) execute in parallel
- Automatically managed by internal thread pool
- Up to 9 concurrent API calls per sample
Performance Impact
Traditional Serial: Sample1[C1→C2→...→C9] → Sample2[C1→C2→...→C9] → ...
Optimized Parallel: Sample1[C1,C2,...,C9] | Sample2[C1,C2,...,C9] | ...
↓ ↓
9 concurrent calls 9 concurrent calls
Expected speedup: ~10x faster for typical workloads
Configuration Example
# 8 samples in parallel, each with 9 concurrent comparisons
# Theoretical max: 8 × 9 = 72 concurrent API calls
python rm_gallery/gallery/evaluation/rmbench.py \
--max_workers=8 \
--max_samples=100
⚠️ Note: Higher parallelism may trigger API rate limits. Start with max_workers=4-8 and adjust based on your API quota.
Evaluation Methodology
3x3 Matrix Comparison
RM-Bench uses a unique 3x3 matrix evaluation approach:
Each cell (i,j) represents the comparison between chosen response i and rejected response j.
Accuracy Metrics
RM-Bench provides three distinct accuracy measurements:
- Hard Accuracy: Upper-right triangle of the matrix
- Compares chosen responses with less sophisticated style vs rejected responses with more sophisticated style
-
Most challenging scenario for reward models
-
Normal Accuracy: Diagonal of the matrix
- Compares responses with the same style level
-
Standard preference evaluation
-
Easy Accuracy: Lower-left triangle of the matrix
- Compares chosen responses with more sophisticated style vs rejected responses with less sophisticated style
- Easiest scenario for reward models
Output Format
Results Structure
The evaluation generates a JSON file with the following structure:
{
"model": "gpt-4o-mini",
"overall_accuracy": {
"hard_acc": 0.65,
"normal_acc": 0.78,
"easy_acc": 0.85,
"overall_acc": 0.76,
"valid_samples": 95,
"total_samples": 100,
"acc_matrix": [
[0.75, 0.68, 0.62],
[0.82, 0.78, 0.71],
[0.88, 0.85, 0.81]
]
},
"subset_accuracy": {
"chat": {
"hard_acc": 0.70,
"normal_acc": 0.80,
"easy_acc": 0.88
},
"code": {
"hard_acc": 0.62,
"normal_acc": 0.75,
"easy_acc": 0.83
}
}
}
Metrics Explanation
- hard_acc: Accuracy when chosen has simpler style than rejected
- normal_acc: Accuracy when chosen and rejected have same style
- easy_acc: Accuracy when chosen has more sophisticated style than rejected
- overall_acc: Average of all matrix cells
- valid_samples: Number of successfully processed samples
- total_samples: Total number of input samples
- acc_matrix: 3x3 matrix showing detailed comparison results
Expected Output
When running the evaluation, you should see output similar to:
$ python rm_gallery/gallery/evaluation/rmbench.py --max_samples=10
INFO - Starting RM-Bench evaluation...
INFO - Loading dataset from: data/benchmarks/RM-Bench/total_dataset.json
INFO - Model: gpt-4o-mini
INFO - Processing 10 samples with 3x3 matrix comparisons...
INFO - Evaluation completed!
INFO - Results saved to: data/results/rmbench.json
INFO - Hard accuracy: 65.0%
INFO - Normal accuracy: 78.0%
INFO - Easy accuracy: 85.0%
INFO - Overall accuracy: 76.0%
INFO - Valid samples: 10/10
Domain-Specific Analysis
Domain Categories
RM-Bench covers five main domains:
- Chat: General conversational responses
- Code: Programming and technical content
- Math: Mathematical problem-solving
- Safety-refuse: Appropriate refusal of harmful requests
- Safety-response: Safe handling of sensitive topics
Per-Domain Evaluation
You can analyze results by domain:
import json
# Load results
with open("data/results/rmbench_results.json", "r") as f:
results = json.load(f)
# Analyze domain performance
for domain, metrics in results["subset_accuracy"].items():
print(f"Domain: {domain}")
print(f" Hard accuracy: {metrics['hard_acc']:.2%}")
print(f" Normal accuracy: {metrics['normal_acc']:.2%}")
print(f" Easy accuracy: {metrics['easy_acc']:.2%}")
print()
Integration with Other Benchmarks
Combining with JudgeBench and RMB
Run comprehensive evaluation across multiple benchmarks:
# Run RM-Bench evaluation
python rm_gallery/gallery/evaluation/rmbench.py \
--data_path="data/benchmarks/RM-Bench/total_dataset.json" \
--result_path="data/results/rmbench_results.json" \
--model="gpt-4o-mini"
# Run JudgeBench evaluation
python rm_gallery/gallery/evaluation/judgebench.py \
--data_path="data/benchmarks/JudgeBench/data/dataset=judgebench,response_model=gpt-4o-2024-05-13.jsonl" \
--result_path="data/results/judgebench_results.json" \
--model="gpt-4o-mini"
# Run RMB evaluation
python rm_gallery/gallery/evaluation/rmb.py \
--data_path="data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set" \
--result_path="data/results/rmb_results.json" \
--model="gpt-4o-mini"
Comparative Analysis
# Compare results across benchmarks
import json
benchmarks = ["rmbench", "judgebench", "rmb"]
results = {}
for benchmark in benchmarks:
with open(f"data/results/{benchmark}_results.json", "r") as f:
results[benchmark] = json.load(f)
# Print comparison
for benchmark, result in results.items():
if benchmark == "rmbench":
accuracy = result["overall_accuracy"]["overall_acc"]
elif benchmark == "judgebench":
accuracy = result["overall_accuracy"]["accuracy"]
else: # rmb
accuracy = result["overall_accuracy"]["accuracy"]
print(f"{benchmark}: {accuracy:.2%}")
Troubleshooting
Common Issues
-
Dataset format errors
-
Model API errors
-
Memory issues with large datasets
-
Invalid comparison matrix
Performance Optimization
- Two-Level Parallelism: The system uses a two-tier parallel architecture:
- Sample-level parallelism: Controlled by
max_workersparameter (default: 8), processes multiple samples concurrently - Comparison-level parallelism: Each sample's 9 comparisons (3x3 matrix) execute in parallel automatically
- Batch Size: Adjust
max_samplesbased on available memory - Model Selection: Use efficient models for large-scale evaluation
- Throughput: With both levels of parallelism, evaluation speed increases significantly (up to 10x faster than serial processing)
Error Resolution
If you encounter evaluation errors:
- Check the 3x3 matrix completion rate
- Verify dataset sample format
- Confirm model response parsing
- Reduce concurrency if rate-limited
Advanced Usage
Custom Template Modification
You can customize the evaluation template by modifying the RMBenchTemplate.format() method:
# Example: Add domain-specific evaluation criteria
def format(cls, query: str, answers: List[str], domain: str = None, **kwargs) -> str:
# Add domain-specific instructions
domain_instructions = {
"code": "Focus on code correctness, efficiency, and readability.",
"math": "Evaluate mathematical accuracy and solution clarity.",
"safety": "Assess safety and appropriateness of responses."
}
instruction = domain_instructions.get(domain, "")
# ... rest of template logic
Matrix Analysis
Analyze the 3x3 comparison matrix in detail:
import numpy as np
import json
# Load results
with open("data/results/rmbench_results.json", "r") as f:
results = json.load(f)
matrix = np.array(results["overall_accuracy"]["acc_matrix"])
# Analyze preference patterns
print("Comparison Matrix:")
print(matrix)
print(f"Hard accuracy (upper right): {np.sum(np.triu(matrix, 1)) / 3:.2%}")
print(f"Normal accuracy (diagonal): {np.mean(np.diag(matrix)):.2%}")
print(f"Easy accuracy (lower left): {np.sum(np.tril(matrix, -1)) / 3:.2%}")
Best Practices
- Start with Small Samples: Begin with 10-20 samples for testing
- Monitor Matrix Completion: Check valid_samples vs total_samples ratio
- Analyze by Domain: Review performance across different content types
- Compare Accuracy Types: Focus on hard_acc as the most challenging metric
- Use Appropriate Models: Ensure models can handle style differentiation
This tutorial provides a comprehensive guide to using the RM-Bench evaluation system for assessing reward models with style and subtlety considerations. The unique 3x3 matrix approach offers deeper insights into model preferences across different response formats and sophistication levels.