Format Graders

Format graders for evaluating structural and formatting aspects of AI responses. These graders validate JSON structures, check length constraints, detect repetition, and verify specific output formats like reasoning tags.

Overview

Grader	Purpose	Type	Score Range	Key Use Case
`JsonValidatorGrader`	Validates JSON syntax	Code-Based	{0, 1}	JSON output validation
`JsonMatchGrader`	Deep comparison of JSON structures	Code-Based	{0, 1}	API response matching
`LengthPenaltyGrader`	Penalizes too short/long responses	Code-Based	≤0 (penalty)	Control response length
`NgramRepetitionPenaltyGrader`	Penalizes repetitive n-grams	Code-Based	≤0 (penalty)	Detect text repetition
`ReasoningFormatGrader`	Checks `<think>` and `<answer>` tags	Code-Based	{0, 1}	Chain-of-thought format
`ReasoningToolCallFormatGrader`	Validates tool call format with JSON	Code-Based	{0, 1}	Agent tool calls

JSON Validation

This category provides graders for validating JSON syntax and comparing JSON structures.

JsonValidatorGrader

Validates whether a response is valid JSON, ensuring structured outputs can be parsed correctly. Use this grader when you need to verify structured data generation, validate API responses, or enforce JSON output requirements in your AI systems.

Parameters:

Parameter	Type	Required	Description
`response`	str	Yes	The text to validate as JSON

Scoring: - 1.0: Valid JSON that can be parsed - 0.0: Invalid JSON or parse error

Example:

import asyncio
from openjudge.graders.format.json.json_validator import JsonValidatorGrader

async def main():
    grader = JsonValidatorGrader()

    # Valid JSON
    result = await grader.aevaluate(
        response='{"name": "Alice", "age": 30, "skills": ["Python", "AI"]}',
    )

    print(f"Score: {result.score}")   # 1.0
    print(f"Reason: {result.reason}") # "Valid JSON"

    # Invalid JSON
    result = await grader.aevaluate(
        response='{"name": "Alice", "age": 30',  # Missing closing brace
    )

    print(f"Score: {result.score}")   # 0.0
    print(f"Reason: {result.reason}") # Error message

asyncio.run(main())

JsonMatchGrader

Performs deep structural comparison of JSON objects by recursively validating that two JSON structures match according to configurable rules. This grader is ideal for comparing generated JSON outputs against ground truth, verifying API responses, evaluating structured data accuracy, and testing JSON generation quality.

Parameters:

Parameter	Type	Required	Description
`reference_response`	str	Yes	Reference JSON string
`response`	str	Yes	Generated JSON to compare
`strict_order`	bool	No	Whether list order matters (default: True)
`ignore_extra_keys`	bool	No	Ignore extra keys in response (default: False)

Scoring: - 1.0: JSON structures match completely - 0.0: Structures differ or parse error

Example:

import asyncio
from openjudge.graders.format.json.json_match import JsonMatchGrader

async def main():
    # Strict matching
    grader = JsonMatchGrader(strict_order=True)

    result = await grader.aevaluate(
        reference_response='{"name": "Alice", "hobbies": ["reading", "swimming"]}',
        response='{"name": "Alice", "hobbies": ["reading", "swimming"]}',
    )

    print(f"Score: {result.score}")   # 1.0 - exact match

    # Order-independent matching
    grader = JsonMatchGrader(strict_order=False)

    result = await grader.aevaluate(
        reference_response='{"hobbies": ["reading", "swimming"]}',
        response='{"hobbies": ["swimming", "reading"]}',
    )

    print(f"Score: {result.score}")   # 1.0 - matches despite different order

    # Ignore extra keys
    grader = JsonMatchGrader(ignore_extra_keys=True)

    result = await grader.aevaluate(
        reference_response='{"name": "Alice"}',
        response='{"name": "Alice", "age": 30, "city": "NYC"}',
    )

    print(f"Score: {result.score}")   # 1.0 - extra keys ignored

asyncio.run(main())

Length & Quality Control

This category provides graders for controlling output length and detecting repetitive patterns.

LengthPenaltyGrader

Applies penalties to responses that are too short or too long, helping you control output verbosity. This grader enforces response length constraints by penalizing overly verbose outputs while ensuring minimum content length, making it valuable for training models to generate concise yet complete responses.

Parameters:

Parameter	Type	Required	Description
`response`	str	Yes	The text to evaluate
`min_length`	int	No	Minimum acceptable length (default: 10)
`max_length`	int	No	Maximum acceptable length (default: 1000)
`penalty_rate`	float	No	Penalty per character violation (default: 0.01)

Scoring: - 0.0: Length within acceptable range - < 0.0: Negative penalty proportional to length violation

Penalty calculation: - If length < min_length: penalty = -(min_length - length) × penalty_rate - If length > max_length: penalty = -(length - max_length) × penalty_rate

Example:

import asyncio
from openjudge.graders.format.length_penalty import LengthPenaltyGrader

async def main():
    grader = LengthPenaltyGrader(
        min_length=50,
        max_length=200,
        penalty_rate=0.1,
    )

    # Acceptable length
    result = await grader.aevaluate(
        response="This response has an acceptable length that falls within the specified range.",
    )

    print(f"Score: {result.score}")   # 0.0 - no penalty
    print(f"Reason: {result.reason}") # "Length acceptable: 50 <= 83 <= 200"

    # Too short
    result = await grader.aevaluate(response="Short")

    print(f"Score: {result.score}")   # -4.5 = -(50-5) * 0.1
    print(f"Reason: {result.reason}") # "Too short: 5 < 50"

    # Too long
    long_text = "A" * 250
    result = await grader.aevaluate(response=long_text)

    print(f"Score: {result.score}")   # -5.0 = -(250-200) * 0.1
    print(f"Reason: {result.reason}") # "Too long: 250 > 200"

asyncio.run(main())

NgramRepetitionPenaltyGrader

Detects and penalizes repetitive patterns in text using N-gram analysis with support for multiple languages and tokenization methods. This grader is essential for quality control of generated text, helping you detect repetitive content, train models to avoid repetition, and evaluate overall text diversity.

Parameters:

Parameter	Type	Required	Description
`response`	str	Yes	The text to analyze
`n`	int	No	N-gram size (default: 3)
`penalty_threshold`	float	No	Threshold for hard penalty (default: 0.3)
`penalty_rate`	float	No	Penalty rate per repetition (default: 1.0)
`use_soft_penalty`	bool	No	Use soft penalty mode (default: False)
`max_penalty`	float	No	Maximum penalty value (default: -1.0)
`tokenizer_type`	str	No	Tokenizer type: tiktoken, jieba, simple (default: tiktoken)
`analyze_scope`	str	No	Analyze "thought" or "full" text (default: full)

Scoring: - 0.0: No significant repetition detected - < 0.0: Negative penalty proportional to repetition rate

Example:

import asyncio
from openjudge.graders.format.ngram_repetition_penalty import NgramRepetitionPenaltyGrader

async def main():
    # Hard threshold penalty
    grader = NgramRepetitionPenaltyGrader(
        n=3,
        penalty_threshold=0.3,
        penalty_rate=1.0,
    )

    # Diverse text
    result = await grader.aevaluate(
        response="The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs.",
    )

    print(f"Score: {result.score}")   # 0.0 or small penalty
    print(f"Metadata: {result.metadata['repetition_rate']}")

    # Repetitive text
    result = await grader.aevaluate(
        response="This is a test. This is a test. This is a test. This is a test.",
    )

    print(f"Score: {result.score}")   # Large negative penalty
    print(f"Repetition rate: {result.metadata['repetition_rate']:.2f}")

    # Soft penalty mode
    grader = NgramRepetitionPenaltyGrader(
        n=2,
        use_soft_penalty=True,
        max_penalty=-2.0,
        min_scaling=0.2,
    )

    result = await grader.aevaluate(
        response="Different words create different patterns without repetition here.",
    )

    print(f"Score: {result.score}")   # Gradual penalty

asyncio.run(main())

Reasoning Format Validation

This category provides graders for validating structured reasoning outputs and agent tool calls.

ReasoningFormatGrader

Validates that responses follow a specific reasoning format with <think> and <answer> tags, essential for chain-of-thought evaluation. Use this grader to enforce structured reasoning in your models, validate chain-of-thought (CoT) formatting, and ensure proper separation between the thinking process and final answers.

Parameters:

Parameter	Type	Required	Description
`response`	str	Yes	The text to validate
`think_token`	str	No	Thinking tag name (default: "think")
`answer_token`	str	No	Answer tag name (default: "answer")

Scoring: - 1.0: Both <think> and <answer> tags present - 0.0: Missing one or both required tags

Example:

import asyncio
from openjudge.graders.format.reasoning_format import ReasoningFormatGrader

async def main():
    grader = ReasoningFormatGrader()

    # Valid format
    result = await grader.aevaluate(
        response="""<think>
First, I need to analyze the problem.
The user is asking about Python benefits.
</think>

<answer>
Python is easy to learn, has extensive libraries, and strong community support.
</answer>"""
    )

    print(f"Score: {result.score}")   # 1.0
    print(f"Reason: {result.reason}") # "All format requirements met"

    # Invalid format - missing tags
    result = await grader.aevaluate(
        response="Python is a great programming language for beginners.",
    )

    print(f"Score: {result.score}")   # 0.0
    print(f"Reason: {result.reason}") # "Missing <think></think> tags; Missing <answer></answer> tags"

    # Custom tags
    grader = ReasoningFormatGrader(think_token="reasoning", answer_token="solution")

    result = await grader.aevaluate(
        response="<reasoning>My thought process</reasoning>\n<solution>Final answer</solution>",
    )

    print(f"Score: {result.score}")   # 1.0

asyncio.run(main())

ReasoningToolCallFormatGrader

Validates that responses follow proper format for tool-calling agents with reasoning by checking for <think> tags combined with either <answer> or <tool_call> tags, and validating JSON structure in tool calls. This grader is ideal for agent output validation, enforcing tool-calling formats, verifying function calls, and ensuring proper multi-step reasoning with tool use.

Parameters:

Parameter	Type	Required	Description
`response`	str	Yes	The text to validate

Valid formats: 1. <think>...</think> + <answer>...</answer> - Reasoning with final answer 2. <think>...</think> + <tool_call>...</tool_call> - Reasoning with tool calls

Tool call JSON requirements: - Must contain name field (function name) - Must contain arguments field (function arguments)

Scoring: - 1.0: Valid format with proper tags and JSON structure - 0.0: Invalid format, missing tags, or malformed JSON

Example:

import asyncio
from openjudge.graders.format.reasoning_tool_format import ReasoningToolCallFormatGrader

async def main():
    grader = ReasoningToolCallFormatGrader()

    # Valid reasoning + answer format
    result = await grader.aevaluate(
        response="""<think>
The user wants to know the weather. I should provide the current information.
</think>

<answer>
The current temperature is 72°F with clear skies.
</answer>"""
    )

    print(f"Score: {result.score}")   # 1.0
    print(f"Reason: {result.reason}") # "Valid <think></think> + <answer></answer> format"

    # Valid reasoning + tool call format
    result = await grader.aevaluate(
        response="""<think>
I need to search for information about Python.
</think>

<tool_call>
{"name": "search", "arguments": {"query": "Python programming language"}}
</tool_call>"""
    )

    print(f"Score: {result.score}")   # 1.0
    print(f"Reason: {result.reason}") # "Valid <think></think> + <tool_call></tool_call> format with valid JSON"

    # Multiple tool calls
    result = await grader.aevaluate(
        response="""<think>
I need to gather data from multiple sources.
</think>

<tool_call>
{"name": "get_weather", "arguments": {"city": "New York"}}
</tool_call>

<tool_call>
{"name": "get_news", "arguments": {"topic": "technology"}}
</tool_call>"""
    )

    print(f"Score: {result.score}")   # 1.0
    print(f"Tool calls: {result.metadata['tool_call_count']}")  # 2

    # Invalid format - missing think tag
    result = await grader.aevaluate(
        response="<answer>Direct answer without thinking</answer>",
    )

    print(f"Score: {result.score}")   # 0.0
    print(f"Reason: {result.reason}") # "Missing <think></think> tags"

    # Invalid format - malformed JSON in tool call
    result = await grader.aevaluate(
        response="""<think>Searching</think>
<tool_call>
{invalid json}
</tool_call>"""
    )

    print(f"Score: {result.score}")   # 0.0
    print(f"Reason: {result.reason}") # "Invalid JSON format in <tool_call> tags"

asyncio.run(main())

Next Steps

Building Graders Overview — Learn how to create custom graders
Create Custom Graders — Build domain-specific format validators
Run Grading Tasks — Execute graders at scale with GradingRunner