Custom graders allow you to define precisely how you want to evaluate AI model responses when built-in evaluation tools don't meet your specific needs. This guide helps you build the right grader for your task by following a structured approach to grader design.
Tip
Before creating custom graders, review the Core Concepts to understand how graders fit into the OpenJudge ecosystem.
Understanding Your Evaluation Needs
Before diving into implementation, it's essential to clearly define what you want to evaluate and in what scenario. Consider whether you're measuring objective properties like length and keyword presence or subjective qualities such as helpfulness and coherence. Determine if you need absolute scores or relative rankings, and think about what constitutes a "good" response in your particular use case.
Depending on your objectives, evaluation can take several forms. For quality assessment, the focus might be on whether responses are factually accurate, effectively address the user's query, maintain a coherent and logical structure, or stay relevant to the topic at hand.
Compliance-focused evaluations serve a different purpose, ensuring that responses adhere to specific guidelines. This could mean verifying the correct format has been used, confirming that the content aligns with safety policies by avoiding harmful material, or simply checking that the model has followed all explicit instructions provided in the prompt.
In contrast, comparative evaluations are designed to rank or select from multiple options. This includes identifying the best-performing model among several candidates, ranking different responses to the same query by quality, or conducting A/B tests to see which version of a prompt yields superior results.
Choosing the Right Approach
Based on your evaluation needs, you'll need to choose both an evaluation approach (how to structure the evaluation) and an implementation method (how to execute the evaluation).
Evaluation Approaches
The Pointwise approach evaluates each response independently, resulting in a score or classification. It is particularly well-suited for measuring absolute quality, determining if a response meets a specific standard, assessing objective properties, or verifying compliance with fixed rules like formatting or policy guidelines.
Conversely, the Listwise approach is inherently comparative. It works by directly comparing multiple responses to the same query, producing a relative ranking. This method is the natural choice when your goal is to select the best candidate from a set of responses or perform a direct head-to-head comparison between models or prompts.
Implementation Methods
Code-Based graders rely on predefined, programmed logic and are most effective for objective assessments. They excel when evaluating quantifiable metrics like response length or keyword presence, where the criteria are clear and unambiguous. Their deterministic nature makes them highly reproducible and cost-effective, especially for high-volume evaluations.
LLM-Based graders leverage the language understanding capabilities of large models (such GPT-4 or Qwen) to make nuanced judgments. They are ideal for subjective assessments that require an understanding of context and meaning, such as judging helpfulness, coherence, or overall quality. These graders are also the preferred choice when you need rich, detailed feedback and explanations for their scores.
Decision Guide
| Scenario | Approach | Method | Why |
|---|---|---|---|
| Objective properties (length, keywords) | Pointwise | Code-Based | Deterministic, fast, cost-effective |
| Subjective qualities (helpfulness, coherence) | Pointwise | LLM-Based | Handles nuanced judgments |
| Response comparison/selection | Listwise | Either | LLM for quality insight, Code-Based for simplicity |
| High-volume evaluation | Either | Code-Based | Cost-effective at scale |
| Detailed feedback needed | Either | LLM-Based | Rich qualitative output |
You can combine approaches—using both LLM-Based and Code-Based graders—for comprehensive evaluation.
Implementing Custom Graders
Once you've determined the appropriate approach and implementation method, you can begin developing your custom grader.
Essential Design Principles
When developing custom graders, ensure they are robust, maintainable, and effective by following core principles:
Core Design Principles
- Explicit Definitions: Establish clear input/output definitions and implement proper error handling
- Predictable Scoring: Use consistent score ranges:
- Binary outcomes: 0.0 (failure) to 1.0 (success)
- Graded evaluations: 0-1 or 1-5 scale
- Rankings: Positive integers starting from 1 (highest rank)
async def evaluate_helpfulness(query: str, response: str) -> GraderScore:
"""Evaluate response helpfulness.
Args:
query: The original user query
response: The model's response to evaluate
Returns:
GraderScore with score between 0.0-1.0 and explanation
"""
try:
# Your evaluation logic here
return GraderScore(
name="helpfulness_evaluator",
score=calculate_helpfulness_score(query, response),
reason="Evaluation successful"
)
except Exception as e:
# Return a default score with error information
return GraderScore(
name="helpfulness_evaluator",
score=0.0,
reason=f"Evaluation failed: {str(e)}"
)
LLM-Based Grader Implementation
To create effective LLM-Based graders:
LLM Grader Components
- Role Definition: Establish the LLM as an expert evaluator
- Clear Instructions: Provide detailed guidance on what to evaluate and how to score
- Scoring Rubric: Define what each score means
- Output Format: Specify the exact JSON structure for responses
from openjudge.graders.llm_grader import LLMGrader
from openjudge.models.openai_chat_model import OpenAIChatModel
# Define your model
model = OpenAIChatModel(
model="qwen3-32b",
api_key="your-api-key"
)
# Create your grader with a well-engineered prompt
helpfulness_grader = LLMGrader(
name="helpfulness_evaluator",
mode="pointwise",
model=model,
template="""
You are an expert evaluator assessing the helpfulness of AI responses.
Instructions:
1. Consider accuracy, completeness, clarity, and relevance
2. Score 0.0 for completely unhelpful responses
3. Score 1.0 for exceptionally helpful responses
4. Score in between for partial helpfulness
Query: {query}
Response: {response}
Provide your response in JSON format:
{
"score": <numerical_score_between_0_and_1>,
"reason": "<brief_explanation_for_score>"
}
""",
description="Evaluates how helpful a response is to the given query"
)
Tip
Incorporate examples of good and poor responses when possible to improve consistency.
Listwise LLM-Based Example: Response Comparator
For comparative evaluations, you can create graders that directly compare multiple responses:
# Create your comparison grader
comparison_grader = LLMGrader(
name="response_comparator",
mode="listwise",
model=model,
template="""
You are an expert judge comparing AI responses to the same query.
Instructions:
1. Compare overall quality, considering accuracy and helpfulness
2. Rank from best (1) to worst (2)
3. Explain your reasoning briefly
Query: {query}
Response 1: {response_1}
Response 2: {response_2}
Provide your response in JSON format:
{
"rank": [<better_response_number>, <worse_response_number>],
"reason": "<brief_explanation_for_ranking>"
}
""",
description="Ranks two responses by quality"
)
Code-Based Grader Implementation
Code-Based Grader Best Practices
Effective Code-Based graders should have:
- Transparent Logic: Clear, understandable evaluation rules
- Modular Design: Separate concerns for maintainability
- Edge Case Handling: Robust error handling
- Consistent Scoring: Predictable score ranges
Pointwise Code-Based Example: Content Quality Checker
from openjudge.graders.function_grader import FunctionGrader
from openjudge.graders.schema import GraderScore
async def content_quality_checker(query: str, response: str) -> GraderScore:
"""Check content quality based on multiple criteria."""
# Define quality criteria
min_length = 20
required_sections = ["introduction", "body", "conclusion"]
# Check length
length_score = min(len(response) / 100.0, 1.0)
length_pass = len(response) >= min_length
# Check for required sections
section_scores = []
for section in required_sections:
section_found = section.lower() in response.lower()
section_scores.append(1.0 if section_found else 0.0)
section_score = sum(section_scores) / len(required_sections)
# Calculate overall score
overall_score = (length_score + section_score) / 2.0
# Generate reason
reasons = []
if length_pass:
reasons.append(f"Length OK ({len(response)} chars)")
else:
reasons.append(f"Too short ({len(response)} chars)")
found_sections = [sec for i, sec in enumerate(required_sections) if section_scores[i] > 0]
missing_sections = [sec for i, sec in enumerate(required_sections) if section_scores[i] == 0]
if found_sections:
reasons.append(f"Found sections: {', '.join(found_sections)}")
if missing_sections:
reasons.append(f"Missing sections: {', '.join(missing_sections)}")
return GraderScore(
name="content_quality_checker",
score=overall_score,
reason="; ".join(reasons)
)
# Create the grader
content_quality_grader = FunctionGrader(
func=content_quality_checker,
name="content_quality",
mode="pointwise"
)
Advanced Techniques
When developing Code-Based graders, consider:
- Compiled Regex: Use for complex pattern matching
- Weighted Scoring: Assign different weights to criteria
- Clear Thresholds: Define explicit pass/fail boundaries
- Metric Combination: Combine multiple simple metrics into complex evaluations
Listwise Code-Based Example: Multi-factor Ranker
from openjudge.graders.function_grader import FunctionGrader
from openjudge.graders.schema import GraderRank
async def multi_factor_ranker(query: str, response_1: str, response_2: str) -> GraderRank:
"""Rank responses based on multiple factors."""
def calculate_score(response):
# Factor 1: Length (0-0.3 weight)
length_score = min(len(response) / 200.0, 1.0) * 0.3
# Factor 2: Keyword density (0-0.4 weight)
keywords = ["accurate", "complete", "clear", "relevant"]
keyword_count = sum(1 for kw in keywords if kw.lower() in response.lower())
keyword_score = (keyword_count / len(keywords)) * 0.4
# Factor 3: Structure indicators (0-0.3 weight)
structure_indicators = [". ", "! ", "? ", "\n\n"]
structure_count = sum(response.count(indicator) for indicator in structure_indicators)
structure_score = min(structure_count / 10.0, 1.0) * 0.3
return length_score + keyword_score + structure_score
# Calculate scores
score_1 = calculate_score(response_1)
score_2 = calculate_score(response_2)
# Rank based on scores
if score_1 > score_2:
rank = [1, 2]
reason = f"Response 1 scored {score_1:.2f} vs Response 2 scored {score_2:.2f}"
elif score_2 > score_1:
rank = [2, 1]
reason = f"Response 2 scored {score_2:.2f} vs Response 1 scored {score_1:.2f}"
else:
rank = [1, 2] # Tie goes to first response
reason = f"Both responses scored {score_1:.2f}"
return GraderRank(
name="multi_factor_ranker",
rank=rank,
reason=reason
)
# Create the grader
multi_factor_grader = FunctionGrader(
func=multi_factor_ranker,
name="multi_factor_ranking",
mode="listwise"
)
Validating Your Custom Graders
After implementing your custom grader, it's crucial to validate that it effectively measures what you intend to measure and produces reliable results. Proper validation ensures your grader performs as expected and produces meaningful results.
For comprehensive guidance on validating your graders and generating detailed validation reports, please refer to the Grader Analysis documentation. This document covers statistical analysis techniques for understanding grader behavior, validation against ground truth data, error analysis to identify specific weaknesses, and building comprehensive validation strategies.
The validation process helps you ensure your grader produces accurate results, measure consistency and reliability, identify potential biases in evaluation, and optimize grader performance based on empirical evidence.
Running Your Custom Graders
Once you've built and validated your custom graders, you can run them using the GradingRunner. This component orchestrates the execution of multiple graders across your dataset, handles concurrency, transforms data as needed, and organizes the results for analysis.
When running graders, focus on configuring data mappers to connect your dataset fields with grader inputs, setting concurrency levels for optimal performance, combining results with aggregators for comprehensive scoring, and handling errors gracefully to prevent complete task failures.
Next Steps
- Generate Graders from Data — Automate grader creation from labeled examples
- Run Grading Tasks — Evaluate your models at scale
- Grader Analysis — Validate and analyze grader results