In OpenJudge, Data Refinement refers to the process of enhancing model outputs by leveraging feedback from Graders. Rather than focusing on data quality for its own sake, we use Graders to evaluate model responses and iteratively improve them through targeted feedback. This guide demonstrates how Graders enable this improvement process.

Graders as Intelligent Evaluators

Graders serve as intelligent evaluators that provide structured feedback to guide model output improvements.

What is Data Refinement with Graders

Data Refinement in OpenJudge is fundamentally about improving model outputs through iterative feedback. Graders act as automated critics that evaluate model responses and provide actionable feedback, which can then be used to generate better responses.

The refinement cycle:

  1. Generate initial response
  2. Evaluate with Graders
  3. Receive structured feedback
  4. Improve response based on feedback
  5. Re-evaluate (repeat as needed)

How to Implement Data Refinement with Graders

To illustrate how Graders work in practice, consider a scenario where we want to improve the quality of responses generated by a language model.

Step 1: Start with Initial Response

Initially, we might have a query and a basic response that lacks detail or accuracy:

# Sample data that needs refinement
sample = {
    "query": "Explain quantum computing in simple terms",
    "response": "It's about computers that are really fast."
}

Step 2: Define an Evaluation Grader

We can then define a Grader that evaluates the quality of this response:

from openjudge.graders.llm_grader import LLMGrader
from openjudge.models.openai_chat_model import OpenAIChatModel

# Initialize our evaluation model
evaluation_model = OpenAIChatModel(model="qwen3-32b", api_key="your-api-key")

# Create a grader that evaluates response quality
quality_grader = LLMGrader(
    model=evaluation_model,
    name="quality_evaluator",
    template="""
    Evaluate the quality of the following response to the given query.

    Query: {query}
    Response: {response}

    Consider factors like accuracy, completeness, clarity, and helpfulness.
    Provide a score from 0.0 to 1.0 and detailed feedback for improvement.

    {{
        "score": {score},
        "reason": {reason}
    }}
    """
)

Step 3: Evaluate Initial Response

When we run this Grader on our sample, it produces a result in the standardized GraderScore format:

{
  "name": "quality_evaluator",
  "score": 0.3,
  "reason": "The response is overly simplistic and lacks key details about quantum computing concepts such as superposition and entanglement. It doesn't explain how quantum computing differs from classical computing or mention practical applications.",
  "metadata": {}
}

Understanding the Feedback

The low score (0.3) and detailed reason clearly indicate what's missing:

  • Key concepts (superposition, entanglement)
  • Differentiation from classical computing
  • Practical applications

Step 4: Generate Improved Response

With this feedback, we can now generate an improved response. In an automated refinement process, we might construct a new prompt that incorporates the feedback:

# Using the feedback to generate an improved response
improved_prompt = f"""
Original query: {sample['query']}
Previous response: {sample['response']}

Feedback on previous response: {grader_result.reason}

Please provide a more detailed and accurate response that addresses the feedback.
"""

# Generate improved response (this would use the model's generation capabilities)
improved_response = {
    "query": "Explain quantum computing in simple terms",
    "response": "Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously, thanks to principles like superposition and entanglement. Unlike classical computers that use bits (0 or 1), quantum computers can process complex calculations much faster for certain problems. While still emerging technology, they show promise in fields like cryptography, drug discovery, and optimization problems."
}

Step 5: Re-evaluate Improved Response

Running the same Grader on this improved response yields:

{
  "name": "quality_evaluator",
  "score": 0.8,
  "reason": "Response provides a much clearer explanation of quantum computing fundamentals including qubits, superposition, and entanglement. It contrasts quantum with classical computing and mentions real-world applications. Could be slightly improved by simplifying some technical terms for a truly 'simple' explanation.",
  "metadata": {}
}

Significant Improvement

The score improved from 0.3 to 0.8 by addressing the feedback points:

  • ✓ Added key concepts (qubits, superposition, entanglement)
  • ✓ Contrasted with classical computing
  • ✓ Included practical applications

This demonstrates the core data refinement process: evaluate → feedback → improve → re-evaluate, leading to progressively better model outputs.

Key Benefits of Grader-Based Refinement

Data Refinement in OpenJudge centers on using Graders to improve model outputs through iterative feedback. By treating Graders as intelligent critics that guide response improvement, you can systematically enhance the quality of AI-generated content.

Advantages of this approach:

Benefit Description
Structured Graders provide consistent evaluation criteria
Scalable Automated feedback works across large datasets
Flexible Works with any model type or domain
Iterative Enables continuous improvement cycles

Best Practices

  • Start with clear evaluation criteria in your grader templates
  • Use multiple graders to evaluate different quality aspects
  • Iterate 2-3 times for optimal results
  • Track score improvements to measure progress

Next Steps