Grader Results Analysis

After successfully running grading tasks on your dataset, the next crucial step is to analyze the grader results to understand how well your AI models or agents are performing. Grader results analysis helps you gain insights from the evaluation results and generate comprehensive reports about model or agent performance.

Why Grader Results Analysis Matters

Grader results analysis is essential because it transforms raw evaluation scores into actionable insights about your AI models or agents. While running graders provides individual scores or rankings, analysis provides critical context to understand what these results mean for your model's real-world performance.

Through the analysis, you can: - Identify model strengths and weaknesses across different evaluation dimensions - Measure overall performance consistency and reliability - Detect potential biases or blind spots in model behavior - Generate data-driven insights for model improvement

More importantly, the analysis enables iterative optimization of the graders themselves. By analyzing the results, you can: - Refine grader criteria based on observed performance patterns - Adjust evaluation thresholds or parameters for better discrimination - Compare different graders' effectiveness and select the most appropriate one for your use case

How to Do Grader Results Analysis

Grader results analysis approaches can be broadly categorized into two types based on data availability: statistical analysis without ground truth and comparative analysis with ground truth. Each approach offers unique insights into your model's performance characteristics.

Statistical Analysis Without Ground Truth

In many cases, you won't have reference labels to compare against. In these scenarios, statistical analysis helps you understand your model's behavior patterns. This approach focuses on examining the intrinsic properties of the scores produced by your graders, revealing patterns that might indicate strengths or weaknesses in model performance.

Statistical Analysis Example

Here's an example to analyzing your model's performance distribution:

from openjudge.analyzer.statistical.distribution_analyzer import DistributionAnalyzer
from openjudge.runner.grading_runner import GradingRunner
from openjudge.graders.common.correctness import CorrectnessGrader
from openjudge.models.openai_chat_model import OpenAIChatModel

# Prepare dataset
dataset = [
    {"query": "What is AI?", "response": "AI is artificial intelligence."},
    {"query": "What is ML?", "response": "ML is machine learning."},
    {"query": "What is DL?", "response": "DL is deep learning."}
]

# Configure grader
grader_configs = {
    "correctness": {
        "grader": CorrectnessGrader(model=OpenAIChatModel("qwen3-32b")),
        "mapper": {"query": "query", "response": "response"}
    }
}

# Run graders on the dataset (as described in run_tasks.md)
runner = GradingRunner(grader_configs=grader_configs)
results = await runner.arun(dataset)

# Analyze score distribution to understand model performance
analyzer = DistributionAnalyzer()
report = analyzer.analyze(dataset, results["correctness"])

print(f"Mean score: {report.mean}")
print(f"Standard deviation: {report.stdev}")
print(f"Score range: {report.min_score} to {report.max_score}")

This distribution analysis helps you understand how your model performs across different inputs. If all scores cluster closely together, it might indicate that your model has limited variability in its responses. On the other hand, if scores are spread widely, it might indicate varying performance on different inputs, which could reveal where your model excels or struggles.

Built-in Statistical Analysis

OpenJudge provides several built-in statistical analysis for examining model performance without ground truth:

Analysis Name	Functionality
DistributionAnalyzer	Examines the distribution of scores across the dataset, including mean, standard deviation, min, and max values to understand the range and variability of model performance
ConsistencyAnalyzer	Evaluates how consistently your model performs when presented with similar inputs or when the same input is evaluated multiple times

Comparative Analysis With Ground Truth

When you have reference labels, you can perform more comprehensive analysis by comparing model performance against known standards. This approach provides direct measurements of how well your model aligns with ground truth or expert judgment and enables calculation of standard performance metrics like precision, recall, and F1 scores.

Comparative analysis is particularly powerful because it gives you concrete measures of how well your model's outputs align with human judgment or other authoritative sources of quality assessment.

Comparative Analysis Example

Here's an example to comparing your model's performance with ground truth labels:

from openjudge.analyzer.validation.accuracy_analyzer import AccuracyAnalyzer
from openjudge.runner.grading_runner import GradingRunner
from openjudge.graders.common.correctness import CorrectnessGrader
from openjudge.models.openai_chat_model import OpenAIChatModel

# Dataset with ground truth labels for comparison
dataset = [
    {"query": "What is AI?", "response": "AI is artificial intelligence.", "correct_label": 1},
    {"query": "What is ML?", "response": "ML is machine learning.", "correct_label": 1},
    {"query": "What is DL?", "response": "Wrong answer", "correct_label": 0}
]

# Configure and run grader
grader_configs = {
    "correctness": {
        "grader": CorrectnessGrader(model=OpenAIChatModel("qwen3-32b")),
        "mapper": {"query": "query", "response": "response"}
    }
}

runner = GradingRunner(grader_configs=grader_configs)
results = await runner.arun(dataset)

# Analyze accuracy
analyzer = AccuracyAnalyzer()
accuracy_report = analyzer.analyze(
    dataset=dataset,
    grader_results=results["correctness"],
    label_path="correct_label"  # Path to ground truth in your data
)

print(f"Overall accuracy: {accuracy_report.accuracy}")

This comparative analysis tells you the percentage of times your model's evaluation matched the ground truth, providing a baseline performance measure. While accuracy alone doesn't tell the whole story, it serves as a foundational metric that helps you understand the general alignment of your model with reference standards.

Built-in Comparative Analysis

OpenJudge provides several built-in comparative analysis for examining model performance with ground truth:

Analysis Name	Functionality
AccuracyAnalyzer	Measures the accuracy of your model's evaluation when ground truth labels are available for comparison
F1ScoreAnalyzer	Calculates F1 scores balancing precision and recall for comprehensive evaluation, particularly useful for imbalanced datasets
FalsePositiveAnalyzer	Identifies instances where the model incorrectly identifies positive cases, helping to understand over-estimation patterns
FalseNegativeAnalyzer	Identifies instances where the model fails to detect actual positive cases, helping to understand under-estimation patterns
PrecisionAnalyzer	Calculates precision of the model's positive predictions compared to actual positive cases
RecallAnalyzer	Calculates recall of the model's ability to identify all actual positive cases
CorrelationAnalyzer	Evaluates the correlation between different metrics or evaluation criteria to understand relationships in model performance

Next Steps

Validate Graders — Ensure graders make accurate judgments
RewardBench2 Validation — Validate against a multi-domain benchmark
Refine Data Quality — Improve model outputs using grader feedback