Generate Rubrics as Graders

Automatically create evaluation graders instead of manually designing criteria. OpenJudge provides two approaches:

Approach	Module	Data Required	Best For
Simple Rubric	`simple_rubric`	Task description only	Quick prototyping, when you have no labeled data
Iterative Rubric	`iterative_rubric`	Labeled preference data	Production quality, when you have training examples

Key Benefits

Save time — Eliminate manual rubric design
Intelligent — Learn criteria from labeled data (Iterative) or task description (Simple)
Consistent — Produce reproducible evaluation standards
Scalable — Quickly prototype graders for new domains

What is a Rubric?

A rubric is a set of evaluation criteria that guides the LLM judge on how to score responses. Instead of asking the LLM to "rate this response," you provide specific criteria like:

Theme: Accuracy
- Tip: The response should contain factually correct information
- Tip: Claims should be verifiable and not contradict established knowledge

Theme: Completeness
- Tip: The response should address all parts of the question
- Tip: Important details should not be omitted

Why rubrics matter:

Without rubrics, LLM judges are inconsistent and subjective
With rubrics, evaluations become reproducible and explainable
The challenge: manually writing good rubrics is time-consuming and requires domain expertise

The solution: Automatically extract these criteria from your task description (Simple Rubric) or labeled data (Iterative Rubric).

When to Use Each Approach

Simple Rubric (Zero-Shot)

Use when you have a clear task description but no labeled data.

Use Simple Rubric When

You need to quickly prototype a grader
You have no labeled preference or scored data
Your task is well-defined and you can describe it clearly
You want to get started immediately without data collection

Limitations

Quality depends on task description clarity
May not capture domain-specific nuances
Less accurate than data-driven approaches

Iterative Rubric (Data-Driven)

Use when you have labeled preference data and want production-quality graders.

Use Iterative Rubric When

You have labeled evaluation data (preference pairs or scored responses)
Manual rubric design is too time-consuming or subjective
Your evaluation criteria are implicit and hard to articulate
You need high accuracy for production use

Don't Use When

You have no labeled data (use Simple Rubric instead)
Your criteria are already well-defined and documented
Simple Code-Based evaluation is sufficient

Simple Rubric: Zero-Shot Generation

Generate evaluation rubrics from task descriptions without any labeled data. The system uses an LLM to create relevant evaluation criteria based on your task context.

How It Works

Provide task description — Describe what your system does
Add context — Optionally provide usage scenario and sample queries
Generate rubrics — LLM creates evaluation criteria automatically
Create grader — Rubrics are injected into an LLMGrader

Quick Example

import asyncio
from openjudge.generator.simple_rubric import (
    SimpleRubricsGenerator,
    SimpleRubricsGeneratorConfig
)
from openjudge.models import OpenAIChatModel
from openjudge.graders.schema import GraderMode

async def main():
    config = SimpleRubricsGeneratorConfig(
        grader_name="translation_quality_grader",
        model=OpenAIChatModel(model="qwen3-32b"),
        grader_mode=GraderMode.POINTWISE,
        task_description="English to Chinese translation assistant for technical documents. Generate rubrics in English.",
        scenario="Users need accurate, fluent translations of technical content. Please respond in English.",
        min_score=0,
        max_score=5,
    )

    generator = SimpleRubricsGenerator(config)
    grader = await generator.generate(
        dataset=[],
        sample_queries=[
            "Translate: 'Machine learning is a subset of AI.'",
            "Translate: 'The API endpoint returned an error.'",
        ]
    )

    return grader

grader = asyncio.run(main())

Inspect Generated Rubrics

print(grader.kwargs.get("rubrics"))

Output (Example):

1. Accuracy: Whether the translation correctly conveys the technical meaning of the original English text
2. Fluency: Whether the translated Chinese is grammatically correct and natural-sounding
3. Technical Appropriateness: Whether the terminology used in the translation is appropriate for a technical context
4. Consistency: Whether similar terms or phrases are consistently translated throughout the response

Evaluate Responses

result = await grader.aevaluate(
    query="Translate: 'The database query returned an error.'",
    response="数据库查询返回了一个错误。"
)
print(result)

Output:

GraderScore(
    name='translation_quality_grader',
    reason="The translation is accurate and correctly conveys the technical meaning of the original English text. The Chinese sentence is grammatically correct and natural-sounding, making it fluent. The terminology used ('数据库查询' for 'database query', '返回了一个错误' for 'returned an error') is appropriate for a technical context. Additionally, the terms are consistently translated throughout the response.",
    score=5.0
)

Simple Rubric Configuration

Parameter	Type	Default	Description
`grader_name`	`str`	required	Name for the generated grader
`model`	`BaseChatModel`	required	LLM for generation and evaluation
`grader_mode`	`GraderMode`	`POINTWISE`	`POINTWISE` or `LISTWISE`
`task_description`	`str`	`""`	Description of the task
`scenario`	`str`	`None`	Optional usage context
`language`	`LanguageEnum`	`EN`	Language for prompts (`EN` or `ZH`)
`min_score`	`int`	`0`	Minimum score (pointwise only)
`max_score`	`int`	`1`	Maximum score (pointwise only)
`default_rubrics`	`List[str]`	`[]`	Fallback rubrics if generation fails
`max_retries`	`int`	`3`	Retry attempts for LLM calls

Iterative Rubric: Data-Driven Generation

Learn evaluation rubrics from labeled preference data. Based on Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling.

How It Works

Two-stage approach:

Infer query-specific rubrics — For each labeled example, the system proposes criteria that explain why one response is better than another
Generalize to core set — Similar rubrics are merged and organized into a compact, non-redundant "Theme-Tips" structure

Data efficiency: Using just 70 preference pairs, this method enables smaller models to match or outperform fully-trained judge models.

Auto-Rubric Pipeline Overview — Auto-Rubric Pipeline: From preference data to evaluation rubrics

Choose Your Evaluation Mode

Mode	Config Class	Use Case	Data Format	Output
Pointwise	`IterativePointwiseRubricsGeneratorConfig`	Score individual responses (e.g., 1-5 rating)	`label_score`	`score`, `reason`
Pairwise	`IterativeListwiseRubricsGeneratorConfig`	Compare two responses (A vs B)	`label_rank` (2 items)	`ranking`, `reason`
Listwise	`IterativeListwiseRubricsGeneratorConfig`	Rank multiple responses (e.g., A > B > C)	`label_rank`	`ranking`, `reason`

Pairwise vs Listwise

Pairwise is a special case of Listwise with exactly 2 responses. Use the same IterativeListwiseRubricsGeneratorConfig for both.

Complete Example: Build a Code Review Grader (Pointwise)

Let's walk through a complete example: building a grader that evaluates code explanation quality.

Step 1: Prepare Labeled Data

dataset = [
    {
        "query": "Explain what this Python code does: `[x**2 for x in range(10)]`",
        "response": "This is a list comprehension that creates a list of squares from 0 to 81. It iterates through numbers 0-9 and squares each one.",
        "label_score": 5  # Excellent: accurate, complete, clear
    },
    {
        "query": "Explain what this Python code does: `lambda x: x * 2`",
        "response": "It's a function.",
        "label_score": 2  # Poor: too brief, missing key details
    },
    {
        "query": "Explain what this Python code does: `dict.get(key, default)`",
        "response": "This method retrieves a value from a dictionary. If the key exists, it returns the associated value. If not, it returns the default value instead of raising a KeyError.",
        "label_score": 5  # Excellent: explains behavior and edge case
    },
    {
        "query": "Explain what this Python code does: `async def fetch(): ...`",
        "response": "This defines an asynchronous function named fetch.",
        "label_score": 3  # Okay: correct but lacks depth
    },
    # ... more examples (recommend 50-100)
]

Step 2: Configure and Run the Generator

import asyncio
from openjudge.generator.iterative_rubric.generator import (
    IterativeRubricsGenerator,
    IterativePointwiseRubricsGeneratorConfig
)
from openjudge.generator.iterative_rubric.query_rubric_generator import (
    POINTWISE_EVALUATION_TEMPLATE
)
from openjudge.models import OpenAIChatModel

async def main():
    # Configure the generator
    config = IterativePointwiseRubricsGeneratorConfig(
        grader_name="code_explanation_grader",
        model=OpenAIChatModel(model="qwen3-32b"),
        custom_evaluation_prompt=POINTWISE_EVALUATION_TEMPLATE,
        min_score=1,
        max_score=5,
        query_specific_generate_number=2,
        enable_categorization=True,
        categories_number=3
    )

    # Generate the grader
    generator = IterativeRubricsGenerator(config)
    grader = await generator.generate(dataset)

    return grader

grader = asyncio.run(main())

Step 3: Inspect the Generated Rubrics

After generation, you can inspect the learned rubrics (stored in grader.kwargs):

print(grader.kwargs.get("rubrics"))

Output (Generated Rubrics):

Rubric 1:
Theme: Correct Identification and Explanation of Code Type and Purpose
- Tip1: Accurately identifies the type of Python code (e.g., list comprehension, lambda function, asynchronous function) and explains its purpose.
- Tip2: Provides a basic explanation of what the code does, such as generating a list of squared numbers or doubling an input.

Rubric 2:
Theme: Detailed Functional Understanding and Contextual Clarity
- Tip1: Describes the range of values processed by the code and the resulting output values.
- Tip2: Includes additional context or examples that clarify how the code operates in practice, such as explaining `dict.get()` behavior or providing usage examples for lambda functions.

Rubric 3:
Theme: Depth of Explanation for Advanced Concepts
- Tip1: Explains advanced features like asynchronous execution using `async def`, non-blocking behavior, and typical use cases (e.g., with `await` or event loops).
- Tip2: Clearly differentiates between correct and incomplete explanations, assigning higher scores for depth and accuracy.

These rubrics are automatically injected into the evaluation prompt when the grader runs.

Step 4: Evaluate New Responses

Now use the grader to evaluate new code explanations:

async def evaluate_new_response():
    result = await grader.aevaluate(
        query="Explain what this Python code does: `map(str, [1, 2, 3])`",
        response="The map function applies str to each element in the list, converting integers to strings. The result is a map object that yields '1', '2', '3' when iterated."
    )
    print(result)

asyncio.run(evaluate_new_response())

Output:

GraderScore(
    name='code_explanation_grader',
    score=4.0,
    reason="The response correctly identifies the code as using the `map` function and accurately explains its purpose of converting integers to strings. It also describes the resulting output as a map object that yields string values when iterated, which demonstrates functional understanding. However, it lacks additional context or examples (e.g., how to convert the map object into a list for immediate use) that could further clarify how the code operates in practice, slightly limiting its depth."
)

Step 5: Compare Good vs. Bad Responses

# Evaluate a poor response
result = await grader.aevaluate(
    query="Explain what this Python code does: `map(str, [1, 2, 3])`",
    response="It maps things."
)
print(result)

Output:

GraderScore(
    name='code_explanation_grader',
    score=1.0,
    reason="The response 'It maps things.' is extremely vague and lacks any meaningful explanation of what the code does. It fails to identify the type of Python code (e.g., the use of the `map` function), explain its purpose, describe the transformation applied, or provide any contextual clarity. The explanation is insufficient and does not meet even the most basic criteria for understanding or explaining the code."
)

Complete Example: Build a Code Solution Comparator (Pairwise)

Let's build a grader that compares two code implementations and determines which solution is better. This is useful for code review, interview assessment, or selecting the best implementation from multiple candidates.

Step 1: Prepare Pairwise Data

dataset = [
    {
        "query": "Write a function to check if a string is a palindrome.",
        "responses": [
            """def is_palindrome(s):
    s = s.lower().replace(" ", "")
    return s == s[::-1]""",
            """def is_palindrome(s):
    for i in range(len(s)):
        if s[i] != s[len(s)-1-i]:
            return False
    return True"""
        ],
        "label_rank": [1, 2]  # Solution 1 is better: concise, handles case/spaces
    },
    {
        "query": "Implement a function to find the maximum element in a list.",
        "responses": [
            """def find_max(lst):
    if not lst:
        return None
    return max(lst)""",
            """def find_max(lst):
    m = lst[0]
    for x in lst:
        if x > m:
            m = x
    return m"""
        ],
        "label_rank": [1, 2]  # Solution 1 is better: handles edge case, uses built-in
    },
    {
        "query": "Write a function to merge two sorted lists into one sorted list.",
        "responses": [
            """def merge(a, b):
    return sorted(a + b)""",
            """def merge(a, b):
    result = []
    i = j = 0
    while i < len(a) and j < len(b):
        if a[i] <= b[j]:
            result.append(a[i])
            i += 1
        else:
            result.append(b[j])
            j += 1
    result.extend(a[i:])
    result.extend(b[j:])
    return result"""
        ],
        "label_rank": [2, 1]  # Solution 2 is better: O(n) vs O(n log n), proper merge algorithm
    },
    {
        "query": "Implement a function to count word frequency in a text.",
        "responses": [
            """def word_freq(text):
    words = text.lower().split()
    freq = {}
    for w in words:
        freq[w] = freq.get(w, 0) + 1
    return freq""",
            """from collections import Counter
def word_freq(text):
    return Counter(text.lower().split())"""
        ],
        "label_rank": [2, 1]  # Solution 2 is better: idiomatic, uses standard library
    },
    # ... more examples (recommend 50-100)
]

Label Format

label_rank contains the rank for each solution (smaller = better)
[1, 2] means the first solution is better than the second
[2, 1] means the second solution is better than the first

Step 2: Configure and Run the Generator

import asyncio
from openjudge.generator.iterative_rubric.generator import (
    IterativeRubricsGenerator,
    IterativeListwiseRubricsGeneratorConfig  # Use Listwise config for pairwise
)
from openjudge.generator.iterative_rubric.query_rubric_generator import (
    LISTWISE_EVALUATION_TEMPLATE
)
from openjudge.models import OpenAIChatModel

async def main():
    # Configure the generator (Pairwise uses Listwise config)
    config = IterativeListwiseRubricsGeneratorConfig(
        grader_name="code_solution_comparator",
        model=OpenAIChatModel(model="qwen3-32b"),
        custom_evaluation_prompt=LISTWISE_EVALUATION_TEMPLATE,
        query_specific_generate_number=2,
        enable_categorization=True,
        categories_number=3
    )

    # Generate the grader
    generator = IterativeRubricsGenerator(config)
    grader = await generator.generate(dataset)

    return grader

grader = asyncio.run(main())

Step 3: Inspect the Generated Rubrics

print(grader.kwargs.get("rubrics"))

Output (Generated Rubrics):

Rubric 1:
Theme: Correctness and Edge Case Handling
- Tip1: The solution must produce correct output for all valid inputs, including edge cases such as empty lists or lists of unequal lengths.
- Tip2: Proper handling of special characters, case insensitivity, and whitespace should be integrated seamlessly without requiring manual preprocessing.

Rubric 2:
Theme: Algorithmic Efficiency and Optimality
- Tip1: Preference is given to solutions with optimal time complexity (e.g., O(n + m) for merging sorted lists rather than O((n + m) log(n + m)) via sorting).
- Tip2: Efficient use of built-in functions or idiomatic Python constructs contributes to better performance and clarity.

Rubric 3:
Theme: Code Readability and Conciseness
- Tip1: Solutions should be concise, using minimal lines of code while maintaining clarity and avoiding unnecessary complexity.
- Tip2: Use of standard library features (e.g., collections.Counter) demonstrates good coding practices and familiarity with Python's ecosystem.

Step 4: Compare New Code Solutions

async def compare_solutions():
    result = await grader.aevaluate(
        query="Write a function to remove duplicates from a list while preserving order.",
        responses=[
            """def remove_duplicates(lst):
    seen = set()
    result = []
    for x in lst:
        if x not in seen:
            seen.add(x)
            result.append(x)
    return result""",
            """def remove_duplicates(lst):
    return list(set(lst))"""
        ]
    )
    print(result)

asyncio.run(compare_solutions())

Output:

GraderRank(
    name='code_solution_comparator',
    rank=[1, 2],
    reason="Response 1 is the best because it correctly removes duplicates while preserving the order of elements. It uses a set for O(1) lookups and a list to maintain insertion order, which ensures correctness and optimal time complexity (O(n)). It also handles edge cases like empty lists gracefully. Response 2 is worse because converting a list to a set and back to a list does not preserve order, violating the problem's requirement. While it is concise and efficient in terms of code length, it fails on correctness for ordered preservation, making it inferior."
)

Iterative Rubric Configuration Reference

Core Parameters

Parameter	Type	Default	Description
`grader_name`	`str`	required	Name for the generated grader
`model`	`BaseChatModel`	required	LLM to use for generation and evaluation
`language`	`LanguageEnum`	`EN`	Language for prompts (`EN` or `ZH`)
`enable_categorization`	`bool`	`False`	Group similar rubrics into categories
`categories_number`	`int`	`5`	Target number of categories
`query_specific_generate_number`	`int`	`1`	Rubrics to generate per training sample

Pointwise-Specific Parameters

Parameter	Type	Default	Description
`min_score`	`int`	`0`	Minimum score value
`max_score`	`int`	`1`	Maximum score value

Evaluation Prompt

Parameter	Type	Default	Description
`custom_evaluation_prompt`	`PromptTemplate`	`None`	Evaluation prompt template

Built-in Templates

Use templates from openjudge.generator.iterative_rubric.query_rubric_generator:

POINTWISE_EVALUATION_TEMPLATE — for scoring
LISTWISE_EVALUATION_TEMPLATE — for ranking

Choosing Between Simple and Iterative Rubric

Scenario	Recommended Approach
Quick prototype, no data	Simple Rubric
Production grader with labeled data	Iterative Rubric
Well-defined task, need fast setup	Simple Rubric
Complex domain, implicit criteria	Iterative Rubric
< 50 labeled examples	Simple Rubric (or collect more data)
50-100+ labeled examples	Iterative Rubric

Workflow Recommendation

Start with Simple Rubric for quick prototyping
Collect preference data during initial deployment
Upgrade to Iterative Rubric when you have 50+ labeled examples

Tips

Data Quality (Iterative Rubric)

Good Practices

Clear preference signals (good vs. bad is obvious)
Diverse query types covering your use case
Consistent labeling standards

Avoid

Ambiguous cases where labels are debatable
Noisy or contradictory labels

Task Description Quality (Simple Rubric)

Good Practices

Be specific about what your system does
Include the target audience or use case
Mention key quality dimensions you care about
Provide representative sample queries

Avoid

Vague descriptions like "chatbot" or "assistant"
Missing context about the domain
No sample queries (the LLM needs examples)

Parameter Tuning (Iterative Rubric)

Goal	Recommended Settings
Fast prototyping	`query_specific_generate_number=1`, `enable_categorization=False`
Small dataset (50-100)	`query_specific_generate_number=1`, `enable_categorization=False`
Medium dataset (≤100)	`query_specific_generate_number=2-3`, `enable_categorization=True`, `categories_number=5`
Large dataset (>100)	`query_specific_generate_number=1`, `enable_categorization=True`, `categories_number=5`

Next Steps

Run Grading Tasks — Execute your generated graders at scale
Grader Analysis — Analyze and validate grader results
Create Custom Graders — Build graders manually with custom logic