Refine Data Quality

In OpenJudge, Data Refinement refers to the process of enhancing model outputs by leveraging feedback from Graders. Rather than focusing on data quality for its own sake, we use Graders to evaluate model responses and iteratively improve them through targeted feedback. This guide demonstrates how Graders enable this improvement process.

Graders as Intelligent Evaluators

Graders serve as intelligent evaluators that provide structured feedback to guide model output improvements.

Data Refinement in OpenJudge is fundamentally about improving model outputs through iterative feedback. Graders act as automated critics that evaluate model responses and provide actionable feedback, which can then be used to generate better responses.

The refinement cycle:

Generate initial response
Evaluate with Graders
Receive structured feedback
Improve response based on feedback
Re-evaluate (repeat as needed)

To illustrate how Graders work in practice, consider a scenario where we want to improve the quality of responses generated by a language model.

Step 1: Start with Initial Response

Initially, we might have a query and a basic response that lacks detail or accuracy:

# Sample data that needs refinement
sample = {
    "query": "Explain quantum computing in simple terms",
    "response": "It's about computers that are really fast."
}

Step 2: Define an Evaluation Grader

We can then define a Grader that evaluates the quality of this response:

from openjudge.graders.llm_grader import LLMGrader
from openjudge.models.openai_chat_model import OpenAIChatModel

# Initialize our evaluation model
evaluation_model = OpenAIChatModel(model="qwen3-32b", api_key="your-api-key")

# Create a grader that evaluates response quality
quality_grader = LLMGrader(
    model=evaluation_model,
    name="quality_evaluator",
    template="""
    Evaluate the quality of the following response to the given query.

    Query: {query}
    Response: {response}

    Consider factors like accuracy, completeness, clarity, and helpfulness.
    Provide a score from 0.0 to 1.0 and detailed feedback for improvement.

    {{
        "score": {score},
        "reason": {reason}
    }}
    """
)

Step 3: Evaluate Initial Response

When we run this Grader on our sample, it produces a result in the standardized GraderScore format:

{
  "name": "quality_evaluator",
  "score": 0.3,
  "reason": "The response is overly simplistic and lacks key details about quantum computing concepts such as superposition and entanglement. It doesn't explain how quantum computing differs from classical computing or mention practical applications.",
  "metadata": {}
}

Understanding the Feedback

The low score (0.3) and detailed reason clearly indicate what's missing:

Key concepts (superposition, entanglement)
Differentiation from classical computing
Practical applications

Step 4: Generate Improved Response

With this feedback, we can now generate an improved response. In an automated refinement process, we might construct a new prompt that incorporates the feedback:

# Using the feedback to generate an improved response
improved_prompt = f"""
Original query: {sample['query']}
Previous response: {sample['response']}

Feedback on previous response: {grader_result.reason}

Please provide a more detailed and accurate response that addresses the feedback.
"""

# Generate improved response (this would use the model's generation capabilities)
improved_response = {
    "query": "Explain quantum computing in simple terms",
    "response": "Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously, thanks to principles like superposition and entanglement. Unlike classical computers that use bits (0 or 1), quantum computers can process complex calculations much faster for certain problems. While still emerging technology, they show promise in fields like cryptography, drug discovery, and optimization problems."
}

Step 5: Re-evaluate Improved Response

Running the same Grader on this improved response yields:

{
  "name": "quality_evaluator",
  "score": 0.8,
  "reason": "Response provides a much clearer explanation of quantum computing fundamentals including qubits, superposition, and entanglement. It contrasts quantum with classical computing and mentions real-world applications. Could be slightly improved by simplifying some technical terms for a truly 'simple' explanation.",
  "metadata": {}
}

Significant Improvement

The score improved from 0.3 to 0.8 by addressing the feedback points:

✓ Added key concepts (qubits, superposition, entanglement)
✓ Contrasted with classical computing
✓ Included practical applications

This demonstrates the core data refinement process: evaluate → feedback → improve → re-evaluate, leading to progressively better model outputs.

Data Refinement in OpenJudge centers on using Graders to improve model outputs through iterative feedback. By treating Graders as intelligent critics that guide response improvement, you can systematically enhance the quality of AI-generated content.

Advantages of this approach:

Benefit	Description
Structured	Graders provide consistent evaluation criteria
Scalable	Automated feedback works across large datasets
Flexible	Works with any model type or domain
Iterative	Enables continuous improvement cycles

Best Practices

Start with clear evaluation criteria in your grader templates
Use multiple graders to evaluate different quality aspects
Iterate 2-3 times for optimal results
Track score improvements to measure progress

Next Steps

Pairwise Evaluation — Compare and rank multiple model outputs
Create Custom Graders — Build specialized graders for feedback
Validate Graders — Ensure feedback quality
Run Grading Tasks — Run comprehensive evaluations at scale

OpenJudge

Refine Data Quality

What is Data Refinement with Graders

How to Implement Data Refinement with Graders

Step 1: Start with Initial Response

Step 2: Define an Evaluation Grader

Step 3: Evaluate Initial Response

Step 4: Generate Improved Response

Step 5: Re-evaluate Improved Response

Key Benefits of Grader-Based Refinement

Next Steps