OpenJudge provides 50+ pre-built graders for evaluating AI responses across quality dimensions, agent behaviors, formats, and modalities. All graders are rigorously evaluated on benchmark datasets to ensure reliability and accuracy. For installation, environment setup, and running your first evaluation, see the Quick Start Guide.
Key Features
-
Multi-Scenario Coverage: Extensive support for diverse domains including Agent, text, code, math, and multimodal tasks via specialized graders. Each category provides multiple graders targeting different evaluation dimensions.
-
Holistic Agent Evaluation: Beyond final outcomes, we assess the entire agent lifecycle—including trajectories and specific components such as Memory, Reflection, Tool Use, Planning, and Action Selection.
-
Quality Assurance: Built for reliability. Every grader comes with benchmark datasets and pytest integration for immediate quality validation. Graders are continuously tested to ensure consistent and accurate results.
-
Unified API Design: All graders follow a consistent interface with
aevaluate()method, returning standardizedGraderScoreobjects withscore,reason, andmetadatafields for seamless integration. -
Flexible Implementation: Choose between LLM-based graders for nuanced quality assessment or code-based graders for fast, deterministic, zero-cost evaluation depending on your requirements.
Available Graders
Choose the right grader for your evaluation needs. OpenJudge organizes graders by evaluation focus, making it easy to find graders for specific tasks.
Implementation Types
- LLM-Based graders: Nuanced quality assessment using LLM judges, suitable for subjective evaluation
- Code-Based graders: Fast, deterministic, zero-cost evaluation using algorithms
General Graders
Evaluate fundamental response quality including relevance, safety, and correctness. → Detailed Documentation
| Grader | Description | Type | Score Range |
|---|---|---|---|
RelevanceGrader |
Evaluates how relevant a response is to the user's query | LLM-Based | 1-5 |
HallucinationGrader |
Detects fabricated information not supported by context | LLM-Based | 1-5 |
HarmfulnessGrader |
Identifies harmful, offensive, or inappropriate content | LLM-Based | 1-5 |
InstructionFollowingGrader |
Checks if response follows given instructions | LLM-Based | 1-5 |
CorrectnessGrader |
Verifies response matches reference answer | LLM-Based | 1-5 |
Agent Graders
Comprehensive evaluation for AI agents across the entire lifecycle. → Detailed Documentation
Action Graders
| Grader | Description | Type | Score Range |
|---|---|---|---|
ActionAlignmentGrader |
Evaluates if agent actions align with goals | LLM-Based | {0, 1} |
ActionLoopDetectionGrader |
Detects repetitive action loops | Code-Based | {0, 1} |
Tool Graders
| Grader | Description | Type | Score Range |
|---|---|---|---|
ToolSelectionGrader |
Evaluates appropriateness of tool selection | LLM-Based | 1-5 |
ToolCallAccuracyGrader |
Checks tool call correctness | LLM-Based | 1-5 |
ToolCallSequenceMatchGrader |
Validates tool call sequence | Code-Based | {0, 1} |
ToolCallSuccessGrader |
Checks if tool calls succeeded | LLM-Based | {0, 1} |
ToolParameterCheckGrader |
Validates tool parameters | LLM-Based | {0, 1} |
Memory Graders
| Grader | Description | Type | Score Range |
|---|---|---|---|
MemoryAccuracyGrader |
Evaluates accuracy of stored memories | LLM-Based | {0, 1} |
MemoryDetailPreservationGrader |
Checks if important details are preserved | LLM-Based | {0, 1} |
MemoryRetrievalEffectivenessGrader |
Evaluates memory retrieval quality | LLM-Based | {0, 1} |
Plan & Reflection Graders
| Grader | Description | Type | Score Range |
|---|---|---|---|
PlanFeasibilityGrader |
Evaluates if plans are executable | LLM-Based | {0, 1} |
ReflectionAccuracyGrader |
Checks accuracy of agent reflections | LLM-Based | {0, 1} |
ReflectionOutcomeUnderstandingGrader |
Evaluates understanding of outcomes | LLM-Based | {0, 1} |
ReflectionProgressAwarenessGrader |
Checks awareness of task progress | LLM-Based | {0, 1} |
Observation Graders
| Grader | Description | Type | Score Range |
|---|---|---|---|
ObservationInformationGainGrader |
Evaluates information gain from observations | Code-Based | [0, 1] |
Trajectory Graders
| Grader | Description | Type | Score Range |
|---|---|---|---|
TrajectoryComprehensiveGrader |
Comprehensive trajectory evaluation | LLM-Based | {0, 1} |
Text Graders
Fast, deterministic text comparison using various algorithms. → Detailed Documentation
| Grader | Description | Type | Score Range |
|---|---|---|---|
SimilarityGrader |
Text similarity with 15+ algorithms (BLEU, ROUGE, F1, etc.) | Code-Based | [0, 1] |
StringMatchGrader |
String matching (exact, prefix, suffix, regex, etc.) | Code-Based | {0, 1} |
NumberAccuracyGrader |
Compares numerical values with tolerance | Code-Based | {0, 1} |
Code Graders
Evaluate code quality, syntax, and execution. → Detailed Documentation
| Grader | Description | Type | Score Range |
|---|---|---|---|
CodeExecutionGrader |
Executes code against test cases | Code-Based | [0, 1] |
SyntaxCheckGrader |
Validates Python syntax using AST | Code-Based | {0, 1} |
CodeStyleGrader |
Checks indentation and naming conventions | Code-Based | [0, 1] |
PatchSimilarityGrader |
Compares code patches using SequenceMatcher | Code-Based | [0, 1] |
Math Graders
Verify mathematical expressions and computations. → Detailed Documentation
| Grader | Description | Type | Score Range |
|---|---|---|---|
MathExpressionVerifyGrader |
Verifies math expressions (LaTeX & plain) | Code-Based | {0, 1} |
Format Graders
Validate structured outputs and formatting. → Detailed Documentation
| Grader | Description | Type | Score Range |
|---|---|---|---|
JsonValidatorGrader |
Validates JSON syntax | Code-Based | {0, 1} |
JsonMatchGrader |
Deep comparison of JSON structures | Code-Based | {0, 1} |
LengthPenaltyGrader |
Penalizes too short/long responses | Code-Based | ≤0 (penalty) |
NgramRepetitionPenaltyGrader |
Penalizes repetitive n-grams | Code-Based | ≤0 (penalty) |
ReasoningFormatGrader |
Checks <think> and <answer> tags |
Code-Based | {0, 1} |
ReasoningToolCallFormatGrader |
Validates tool call format with JSON | Code-Based | {0, 1} |
Multimodal Graders
Evaluate vision-language tasks and image quality. → Detailed Documentation
| Grader | Description | Type | Score Range |
|---|---|---|---|
ImageCoherenceGrader |
Evaluates image-text coherence | LLM-Based | {0, 1} |
ImageHelpfulnessGrader |
Assesses if images help understanding | LLM-Based | {0, 1} |
TextToImageGrader |
Evaluates text-to-image generation quality | LLM-Based | {0, 1} |
ImageEditingGrader |
Evaluates image editing quality | LLM-Based | {0, 1} |
Next Steps
Explore Graders by Category:
- General Graders — Quality assessment (Relevance, Hallucination, Harmfulness, Instruction Following, Correctness)
- Agent Graders — Agent evaluation (Action, Tool, Memory, Plan, Reflection, Trajectory)
- Text Graders — Text similarity and matching (15+ algorithms)
- Code & Math Graders — Code execution and math verification
- Format Graders — Structure validation (JSON, Length, Repetition, Reasoning Format)
- Multimodal Graders — Vision and image tasks
Advanced Usage:
- Run Evaluation Tasks — Batch processing and reporting
- Create Custom Graders — Build domain-specific evaluators