Extend OpenJudge beyond built-in evaluators by creating custom graders or training reward models. Build domain-specific evaluation logic that seamlessly integrates with OpenJudge's evaluation pipeline.
Why Build Custom Graders?
While OpenJudge provides 50+ pre-built graders, custom graders enable you to evaluate industry-specific criteria (legal, medical, financial), implement proprietary scoring logic, and train models that learn from your preference data. They also help optimize costs by replacing expensive API judges with self-hosted models while maintaining consistent evaluation standards across applications.
Building Approaches
OpenJudge supports three paths for creating custom graders, each optimized for different scenarios.
| Approach | Time to Deploy | Data Required | Best For | Cost Profile |
|---|---|---|---|---|
| Create Custom Graders | Minutes | None | Quick prototyping, domain-specific logic | Pay-per-query (API) or free (code-based) |
| Generate from Data | 1-4 hours | 50-500 examples | Iterative refinement, transparent rubrics | Medium setup + pay-per-query |
| Train Reward Models | 1-3 days | 1K-100K pairs | High-volume production (>1M queries/month) | High upfront, 10x lower per-query |
Use this decision tree to choose the right approach based on your data availability and requirements:
START
│
▼
┌─────────────────────┐
│ Have evaluation │
│ data with labels? │
└──────┬───────┬──────┘
│ │
YES │ │ NO
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ Want to │ │ Need evaluation │
│ train model? │ │ now? │
└────┬────┬────┘ └────┬─────────┬───┘
│ │ │ │
YES │ │ NO YES │ │ NO
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐ ┌──────────┐ ┌────────┐ ┌──────────┐
│Train │ │Generator │ │Custom │ │ Define │
│Model │ │ (Rubric) │ │Graders │ │ criteria │
└──────┘ └──────────┘ └────────┘ └──────────┘
│ │ │ │
└─────────┴───────────┴────────────┘
│
▼
┌───────────────────────────────┐
│ Use in evaluation pipeline │
│ (GradingRunner, batch eval) │
└───────────────────────────────┘
Choose based on your situation:
- Have labeled data + need automation? → Train a reward model
- Have data + need fast iteration? → Generate rubrics from data
- No data + need immediate results? → Create custom graders
Approach 1: Create Custom Graders
Define evaluation logic using LLM judges or code-based functions with no training required. LLM-based graders use models like qwen3-32b with custom prompts for domain-specific criteria. Code-based graders implement deterministic logic—checking response length, keyword presence, format validation, or compliance requirements.
Learn more: Create Custom Graders → | Built-in Graders →
Approach 2: Generate Graders from Data
Automatically analyze evaluation data to create structured scoring rubrics. Provide 50-500 labeled examples, and the generator extracts patterns to build interpretable criteria. Generated graders produce explicit rubrics that explain scoring decisions, ideal for scenarios requiring transparency and rapid refinement.
Learn more: Generate Graders from Data →
Approach 3: Train Reward Models
Train neural networks on preference data to learn evaluation criteria automatically. Supports Bradley-Terry (preference pairs), Generative Pointwise (absolute scores), and Generative Pairwise (comparison decisions). Requires 1K-100K examples and 1-3 days but delivers highly consistent evaluation at 10x lower per-query cost—ideal for high-volume scenarios exceeding 1M queries per month.
Next Steps
- Create Custom Graders — Build graders using LLM or code-based logic
- Generate Graders from Data — Auto-generate rubrics from labeled data
- Built-in Graders — Explore pre-built graders to customize
- Run Grading Tasks — Deploy graders at scale with batch workflows