Overview

Why OpenJudge?

OpenJudge is a unified framework designed to drive LLM and Agent application excellence through Holistic Evaluation and Quality Rewards.

Evaluation and reward signals are the cornerstones of application excellence. Holistic evaluation enables the systematic analysis of shortcomings to drive rapid iteration, while high-quality rewards provide the essential foundation for advanced optimization and fine-tuning.

OpenJudge unifies evaluation metrics and reward signals into a single, standardized Grader interface, offering pre-built graders, flexible customization, and seamless framework integration.

Key Features

Systematic & Quality-Assured Grader Library: Access 50+ production-ready graders featuring a comprehensive taxonomy, rigorously validated for reliable performance.
- Multi-Scenario Coverage: Extensive support for diverse domains including Agent, text, code, math, and multimodal tasks via specialized graders. Explore Supported Scenarios→
- Holistic Agent Evaluation: Beyond final outcomes, we assess the entire lifecycle—including trajectories and specific components (Memory, Reflection, Tool Use). Agent Lifecycle Evaluation →
- Quality Assurance: Built for reliability. Every grader comes with benchmark datasets and pytest integration for immediate quality validation. View Benchmark Datasets→
Flexible Grader Building: Choose the build method that fits your requirements:
- Customization: Easily extend or modify pre-defined graders to fit your specific needs. Custom Grader Development Guide →
- Data-Driven Rubrics: Have a few examples but no clear rules? Use our tools to automatically generate white-box evaluation criteria (Rubrics) based on your data. Automatic Rubric Generation Tutorial →
- Training Judge Models: For high-scale and specialized scenarios, we are developing the capability to train dedicated Judge models. Support for SFT, Bradley-Terry models, and Reinforcement Learning workflows is on the way to help you build high-performance, domain-specific graders. 🚧 Coming Soon
Easy Integration: We're actively building seamless connectors for mainstream observability platforms and training frameworks. Stay tuned!🚧 Coming Soon

Quick Tutorials

Evaluate An AI Agent

Comprehensive evaluation for AI Agents: Learn to evaluate the full lifecycle—including final response, trajectory, tool usage, plan, memory, reflection, observation—using OpenJudge Graders.

Build Rewards for Training

Construct High-Quality Reward Signals: Create robust reward functions for model and agent alignment by aggregating diverse graders with custom weighting and high-concurrency support.

More Tutorials

Built-in Graders

Agent

Agent graders for evaluating various aspects of AI agent behavior. These graders assess action selection, tool usage, memory management, planning, reflection, and overall trajectory quality.

General Tasks

Assess fundamental capabilities such as instruction following, text quality, safety guardrails, and format.

Multimodal

Vision-language graders for evaluating AI responses involving images. These graders assess image-text coherence, image helpfulness, and text-to-image generation quality.

$\text{[math]}$

Math & Code

Specialized graders for evaluating code generation and mathematical problem-solving capabilities. These graders assess syntax correctness, execution results, code style, and mathematical expression accuracy.

Text

Algorithm-based graders for text similarity and matching. Fast, deterministic, and zero-cost evaluation using BLEU, ROUGE, F1, regex, and 15+ similarity algorithms.

Format

Format validation graders for structured outputs. Validate JSON syntax, check length constraints, detect repetition, and verify reasoning tags for chain-of-thought.

Build Graders

Customization

Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader.

Data-Driven Rubrics

Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader.

Trainable Judge Model

🚧 Work in Progress

Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.

Integrations

Evaluation Frameworks

🚧 Work in Progress

Seamlessly connect with mainstream platforms like LangSmith and LangFuse. Streamline your evaluation pipelines and monitor agent performance with flexible APIs.

Training Frameworks

🚧 Work in Progress

Directly integrate into training loops such as VERL. Use Graders as high-quality reward functions for RLHF/RLAIF to align models effectively.

Applications

Data Refinement

Automate the curation of high-quality datasets. Use Graders to filter, rank, and synthesize training data for Supervised Fine-Tuning (SFT).

Pairwise Evaluation

Compare and rank multiple model outputs using LLM-based pairwise comparisons. Compute win rates, generate win matrices, and identify the best-performing models.

Running Graders

Run Grading Tasks

Orchestrate evaluations at scale with GradingRunner. Configure data mapping, control concurrency, and aggregate results from multiple graders into unified scores.

Analyze Grader Results

Transform raw scores into actionable insights. Examine score distributions, measure consistency, and compare performance against ground truth labels.

Validating Graders

Validation Overview

Ensure your graders make accurate judgments. Learn validation workflows, best practices, and metrics for measuring grader quality.

RewardBench2

Validate against the RewardBench2 benchmark for multi-domain response quality evaluation with standardized ground truth.