Overview
RMB (Reward Model Benchmark) is a comprehensive evaluation framework for reward models in LLM alignment. It covers over 49 real-world scenarios and includes both pairwise and Best-of-N (BoN) evaluations to better reflect the effectiveness of reward models in guiding alignment optimization.
The benchmark evaluates models across two main dimensions: - Helpfulness: Including brainstorming, chat, classification, code generation, math, reasoning, and more - Harmlessness: Focusing on safety, toxicity detection, and harmful content avoidance
Features
- Comprehensive Coverage: 49+ real-world scenarios across helpfulness and harmlessness
- Pairwise Comparisons: Direct comparison between model responses
- Category Analysis: Detailed breakdown by scenario categories
- Parallel Processing: Efficient evaluation with configurable worker threads
- Flexible Model Support: Compatible with various LLM APIs
Quick Start
Step 1: Download Dataset
# Create benchmarks directory and clone dataset
mkdir -p data/benchmarks
cd data/benchmarks
git clone https://github.com/Zhou-Zoey/RMB-Reward-Model-Benchmark.git
cd ../../
Step 2: Verify Installation
Ensure you have the required environment set up:
# Check if the module can be imported
python -c "from rm_gallery.gallery.evaluation.rmb import main; print('RMB evaluation module loaded successfully')"
Step 3: Basic Usage
# Run evaluation on a sample dataset
python rm_gallery/gallery/evaluation/rmb.py \
--data_path="data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set" \
--result_path="data/results/rmb_results.json" \
--max_samples=10 \
--model="gpt-4o-mini"
Step 4: Check Results
Installation and Environment Setup
Prerequisites
Environment Variables
Set up your API keys:
# For OpenAI models
export OPENAI_API_KEY="your-api-key"
# For other providers
export ANTHROPIC_API_KEY="your-anthropic-key"
export DEEPSEEK_API_KEY="your-deepseek-key"
Verify Setup
# Test model connection
python -c "from rm_gallery.core.model.openai_llm import OpenaiLLM; llm = OpenaiLLM(model='gpt-4o-mini'); print('Model connection successful')"
Usage Examples
Basic Evaluation
Evaluate a small subset of the RMB dataset:
python rm_gallery/gallery/evaluation/rmb.py \
--data_path="data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set" \
--result_path="data/results/rmb_basic.json" \
--max_samples=50 \
--model="gpt-4o-mini" \
--max_workers=4
Full Dataset Evaluation
Run evaluation on the complete dataset:
python rm_gallery/gallery/evaluation/rmb.py \
--data_path="data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set" \
--result_path="data/results/rmb_full.json" \
--max_samples=1000 \
--model="gpt-4o" \
--max_workers=8
Specific Category Evaluation
Evaluate only helpfulness scenarios:
python rm_gallery/gallery/evaluation/rmb.py \
--data_path="data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set/Helpfulness" \
--result_path="data/results/rmb_helpfulness.json" \
--max_samples=200 \
--model="claude-3-5-sonnet-20241022"
Model Comparison
Compare different models on the same dataset:
# Evaluate GPT-4o
python rm_gallery/gallery/evaluation/rmb.py \
--data_path="data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set" \
--result_path="data/results/rmb_gpt4o.json" \
--model="gpt-4o" \
--max_samples=100
# Evaluate Claude
python rm_gallery/gallery/evaluation/rmb.py \
--data_path="data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set" \
--result_path="data/results/rmb_claude.json" \
--model="claude-3-5-sonnet-20241022" \
--max_samples=100
Configuration Parameters
Command Line Arguments
| Parameter | Type | Default | Description |
|---|---|---|---|
data_path |
str | "data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set" |
Path to RMB dataset |
result_path |
str | "data/results/rmb.json" |
Output file path |
max_samples |
int | 10 |
Maximum number of samples to evaluate |
model |
str/dict | "qwen3-32b" |
Model identifier or configuration |
max_workers |
int | 8 |
Number of parallel processing workers |
Advanced Configuration
For advanced usage, you can pass model configuration as a dictionary:
# Example with custom model configuration
model_config = {
"model": "gpt-4o",
"temperature": 0.1,
"max_tokens": 2048,
"timeout": 60
}
# Run with custom configuration
python rm_gallery/gallery/evaluation/rmb.py \
--model='{"model": "gpt-4o", "temperature": 0.1, "max_tokens": 2048}'
Dataset Structure
The RMB dataset is organized as follows:
RMB_dataset/
├── Pairwise_set/
│ ├── Helpfulness/
│ │ ├── Brainstorming/
│ │ ├── Chat/
│ │ ├── Classification/
│ │ ├── Code/
│ │ ├── Math/
│ │ └── ...
│ └── Harmlessness/
│ ├── Safety/
│ ├── Toxicity/
│ └── ...
└── Best_of_N_set/
└── ...
Output Format
Results Structure
The evaluation generates a JSON file with the following structure:
{
"model": "gpt-4o-mini",
"overall_accuracy": {
"accuracy": 0.75,
"correct_count": 75,
"valid_samples": 100,
"total_samples": 100,
"choice_distribution": {
"0": 45,
"1": 55
}
},
"subset_accuracy": {
"Helpfulness/Brainstorming": {
"accuracy": 0.80,
"correct_count": 20,
"valid_samples": 25,
"total_samples": 25
},
"Harmlessness/Safety": {
"accuracy": 0.70,
"correct_count": 14,
"valid_samples": 20,
"total_samples": 20
}
}
}
Metrics Explanation
- accuracy: Percentage of correct preference selections
- correct_count: Number of samples where the model chose the preferred response
- valid_samples: Number of successfully processed samples
- total_samples: Total number of input samples
- choice_distribution: Distribution of selected responses (0 for first, 1 for second)
Expected Output
When running the evaluation, you should see output similar to:
$ python rm_gallery/gallery/evaluation/rmb.py --max_samples=10
INFO - Starting RMB evaluation...
INFO - Loading dataset from: data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set
INFO - Model: gpt-4o-mini
INFO - Processing 10 samples...
INFO - Evaluation completed!
INFO - Results saved to: data/results/rmb.json
INFO - Overall accuracy: 75.0%
INFO - Valid samples: 10/10
Integration with Other Tools
Combining with JudgeBench
You can run both RMB and JudgeBench evaluations for comprehensive assessment:
# Run RMB evaluation
python rm_gallery/gallery/evaluation/rmb.py \
--data_path="data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set" \
--result_path="data/results/rmb_results.json" \
--model="gpt-4o-mini"
# Run JudgeBench evaluation
python rm_gallery/gallery/evaluation/judgebench.py \
--data_path="data/benchmarks/JudgeBench/data/dataset=judgebench,response_model=gpt-4o-2024-05-13.jsonl" \
--result_path="data/results/judgebench_results.json" \
--model="gpt-4o-mini"
Batch Processing
For processing multiple categories in batch:
#!/bin/bash
# batch_rmb_eval.sh
categories=(
"Helpfulness/Brainstorming"
"Helpfulness/Chat"
"Helpfulness/Code"
"Harmlessness/Safety"
)
for category in "${categories[@]}"; do
echo "Processing category: $category"
python rm_gallery/gallery/evaluation/rmb.py \
--data_path="data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/Pairwise_set/$category" \
--result_path="data/results/rmb_${category//\//_}.json" \
--model="gpt-4o-mini" \
--max_samples=50
done
Troubleshooting
Common Issues
- Dataset not found ```bash
Ensure you've cloned the repository
ls data/benchmarks/RMB-Reward-Model-Benchmark/RMB_dataset/
2. **Model API errors**
```bash
# Check API key configuration
echo $OPENAI_API_KEY
```
3. **Memory issues with large datasets**
```bash
# Reduce max_samples and max_workers
python rm_gallery/gallery/evaluation/rmb.py --max_samples=10 --max_workers=2
```
4. **Import errors**
```bash
# Verify installation
python -c "import rm_gallery.gallery.evaluation.rmb"
```
### Performance Optimization
- **Parallel Processing**: Increase `max_workers` based on your system capacity
- **Batch Size**: Adjust `max_samples` to balance memory usage and processing time
- **Model Selection**: Use faster models for initial testing, more capable models for final evaluation
### Error Resolution
If you encounter evaluation errors:
1. Check the error logs in the output
2. Verify dataset file integrity
3. Confirm model API accessibility
4. Reduce concurrency if rate-limited
## Best Practices
1. **Start Small**: Begin with a small `max_samples` value for testing
2. **Monitor Resources**: Watch CPU and memory usage during evaluation
3. **Save Intermediate Results**: Use different output paths for different experiments
4. **Validate Results**: Review accuracy metrics and choice distributions
5. **Document Experiments**: Keep track of model configurations and results
## Advanced Features
### Custom Templates
The RMB evaluation uses structured templates for consistent evaluation. You can modify the template in `RMBTemplate.format()` method for custom evaluation criteria.
### Result Analysis
After evaluation, you can analyze results programmatically:
```python
import json
# Load results
with open("data/results/rmb_results.json", "r") as f:
results = json.load(f)
# Analyze category performance
for category, metrics in results["subset_accuracy"].items():
print(f"{category}: {metrics['accuracy']:.2%}")
This tutorial provides a comprehensive guide to using the RMB benchmark evaluation system. For additional support or questions, refer to the original RMB repository documentation.