Ensure your graders make accurate, reliable judgments by validating them against datasets with known ground truth—just as you would test any critical system before production.
Graders are evaluation systems that judge AI outputs, but they need evaluation themselves. Without validation, you risk deploying evaluators that introduce systematic errors, favor certain response types, or fail on your specific use cases. Validation quantifies grader accuracy, identifies biases, and builds confidence that your evaluation pipeline measures what it should.
Core Workflow: Run your grader on test cases with known correct answers → Compare predictions against ground truth → Analyze accuracy and error patterns → Refine and re-validate until quality thresholds are met.
How to Validate a Grader?
Validation compares your grader's judgments against known ground truth. Run your grader on a validation dataset, compare predictions with ground truth labels, compute accuracy metrics, and generate a validation report to measure evaluation quality.
Approach 1: Benchmark Validation
Validate against public benchmarks with standardized ground truth. This approach works well when you need reproducible results, want to compare with published baselines, or need quick validation for general-purpose graders without collecting custom data. We currently support RewardBench2 for multi-domain response quality evaluation, with MT-Bench and AlpacaEval coming soon.
Example:
from openjudge.graders.llm_grader import LLMGrader
from openjudge.models import OpenAIChatModel
from openjudge.runner import GradingRunner
# Create grader
model = OpenAIChatModel(model="qwen3-32b")
grader = LLMGrader(name="my_grader", model=model)
# Run on benchmark
runner = GradingRunner(grader_configs={"my_grader": grader})
results = await runner.arun(rewardbench2_dataset)
# Analyze accuracy
from openjudge.analyzer import ValidationAnalyzer
analyzer = ValidationAnalyzer()
report = analyzer.analyze(dataset=rewardbench2_dataset, grader_results=results["my_grader"])
print(f"Accuracy: {report.metadata['accuracy']:.2%}")
Approach 2: Custom Validation
Build validation pipelines tailored to your domain and evaluation criteria. Use this approach for domain-specific tasks (legal, medical, finance), proprietary test sets, or when you need full control over validation methodology.
Validation Metrics:
Choose metrics based on your evaluation task:
For graders that rank or select best responses:
| Metric | When to Use | Interpretation |
|---|---|---|
| Accuracy | Binary classification (correct/incorrect) | % of times grader selects correct answer |
| Top-K Accuracy | Multiple acceptable answers | % of times correct answer is in top K predictions |
| Mean Reciprocal Rank (MRR) | Ranking quality | Average reciprocal rank of correct answer |
| Kendall's Tau | Full ranking correlation | Agreement between predicted and true rankings |
For graders that output continuous scores:
| Metric | When to Use | Interpretation |
|---|---|---|
| Pearson Correlation | Linear relationship | How well scores correlate with ground truth |
| Spearman Correlation | Ranking correlation | Agreement in relative ordering |
| Mean Absolute Error (MAE) | Score accuracy | Average distance from ground truth scores |
| F1 Score | Binary threshold (pass/fail) | Balance between precision and recall |
See cookbook for complete examples: tutorials/cookbooks/grader_validation/
Best Practices
Data Quality: Use diverse test cases with 100+ samples for reliable accuracy estimates. Ensure balanced categories (20+ samples each), clear ground truth from high-agreement annotations, and hold-out sets that never overlap with training data.
Validation Design: Design validation to mirror real-world use cases. Randomize answer positions to prevent position bias, run multiple times with different seeds for robustness, and review failed cases to identify systematic issues.
Result Interpretation: Remember that 70% accuracy may be excellent or poor depending on task difficulty. Compare against baselines and existing graders, analyze per-category performance to uncover hidden weaknesses, and use confidence intervals for small validation sets.
Tip: For advanced validation, try k-fold cross-validation to test robustness, adversarial testing with challenging cases, and confidence calibration to verify if grader confidence correlates with actual accuracy.
Next Steps
- RewardBench2 Validation — Validate on a multi-domain benchmark
- Grader Analysis — Generate comprehensive validation reports
- Create Custom Graders — Build custom validation pipelines