Contribute to OpenJudge

Welcome! OpenJudge is an open-source reward model platform. Your contributions help make AI alignment and evaluation more accessible to the community.

Ways to Contribute

We welcome contributions of all kinds:

Bug Fixes: Identify and resolve issues
New Features: Add graders, training methods, or integrations
Documentation: Improve guides and examples
Examples: Share practical use cases and tutorials

Set Up Development Environment

Step 1: Fork and CloneStep 2: Install DependenciesStep 3: Verify Installation

git clone https://github.com/YOUR_USERNAME/OpenJudge.git
cd OpenJudge

pip install -e ".[dev]"

python -c "from openjudge.graders.common import RelevanceGrader; print('✓ Installation successful')"

Follow Code Standards

Python Coding Standards

Naming: - snake_case for functions/variables - PascalCase for classes - UPPER_CASE for constants

Requirements: - Use type hints for all parameters and returns - Include docstrings (Args, Returns, Example) - Handle exceptions with context - Validate inputs early - Optimize for readability

Testing Guidelines

Before submitting, verify your changes work:

# Verify installation
python -c "from openjudge.graders.common import RelevanceGrader; print('✓ Works')"

# Run tests (optional)
pytest tests/ -v

Testing Guidelines

Manual testing is recommended — Test your changes with real examples
Automated tests — We'll help you add tests during PR review if needed
Focus on functionality — Make sure your code works for your use case

To ensure the quality and reliability of OpenJudge project, all key components require proper testing. LLMGraders specifically must pass both unit and quality testing.

Unit Testing

Unit tests verify individual components of your code. Each component should include tests for initialization, successful operation, edge cases, and error handling.

import pytest
from unittest.mock import AsyncMock
from openjudge.graders.category.grader_module import YourGrader

@pytest.mark.unit
class TestYourGrader:
    def test_initialization(self):
        """Test successful initialization with required parameters"""
        ...

    @pytest.mark.asyncio
    async def test_successful_operation(self):
        """Test successful operation with valid inputs"""
        ...

    @pytest.mark.asyncio
    async def test_error_handling(self):
        """Test graceful error handling when dependencies fail"""
        ...

All external services (especially LLM APIs) must be mocked to enable offline testing.

Quality Testing

For LLMGraders, quality testing ensures the grader correctly evaluates response quality by comparing against human annotations or expected standards.

Environment Configuration

Quality tests require real API keys and detect them via environment variables:

import os
import pytest

# Check for required environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_BASE_URL = os.getenv("OPENAI_BASE_URL")
RUN_QUALITY_TESTS = bool(OPENAI_API_KEY and OPENAI_BASE_URL)

# Skip tests if API keys are not configured
@pytest.mark.skipif(not RUN_QUALITY_TESTS, reason="Requires API keys to run quality tests")
@pytest.mark.quality
class TestGraderQuality:
    @pytest.fixture
    def dataset(self):
        """Load dataset for quality testing"""
        ...

    @pytest.fixture
    def model(self):
        """Create a model instance for testing"""
        ...

    @pytest.mark.asyncio
    async def test_discriminative_power_with_runner(self, dataset, model):
        """Test discriminative power of the model against a dataset"""
        ...

    @pytest.mark.asyncio
    async def test_consistency_with_runner(self, dataset, model):
        """Test consistency of the model against a dataset"""
        ...

Contributing Graders

When contributing a new grader, follow these guidelines:

Grader Categories

Place your grader in the appropriate category:

Category	Path	Purpose
Common	`openjudge/graders/common/`	General-purpose (Relevance, Hallucination)
Agent	`openjudge/graders/agent/`	Agent evaluation
Code	`openjudge/graders/code/`	Code evaluation
Format	`openjudge/graders/format/`	Format validation
Text	`openjudge/graders/text/`	Text similarity
Multimodal	`openjudge/graders/multimodal/`	Vision/image

Grader Specifications

Grader Requirements

Code: - Inherit from BaseGrader - Use type hints for all parameters - Include clear docstring with usage example - Return GraderScore with score (0.0-1.0) and reason - Use naming: MyGrader (class), my_grader.py (file), "my_grader" (name)

Documentation: - Add to docs/built_in_graders/ with When to use, Parameters, Scoring criteria, and Example - Use qwen3-32b as model name in examples

Testing: - Manually test your grader works - We'll help with automated tests during PR review

Grader Method Parameter Naming Convention

To ensure consistency and usability, the aevaluate method of Graders should follow these common parameter naming conventions:

Parameter Name	Description	Alternative Names (Not Recommended)
`query`	User's original question or instruction	`user_query`, `input_query`, `question`, `instruction`
`response`	Model-generated answer or output	`answer`, `generated`, `candidate`, `output`, `generated_text`, `prediction`, `generated_response`, `completion`
`reference_response`	Standard answer or expected output	`expected`, `reference`
`context`	Additional context required for evaluation	`input`, `input_text`, `text`
`tool_definitions`	Available tool definitions list	`tool_definition`
`tool_calls`	Actual tool call records	`generated_tool_call`

Minimal Grader Template

from openjudge.graders.base_grader import BaseGrader
from openjudge.graders.schema import GraderScore

class MyGrader(BaseGrader):
    """Evaluate [specific aspect].

    Args:
        model: LLM model for evaluation
    """

    def __init__(self, model, **kwargs):
        super().__init__(name="my_grader", **kwargs)
        self.model = model

    async def aevaluate(self, query: str, response: str, **kwargs) -> GraderScore:
        """Evaluate response quality."""
        # Your evaluation logic
        score = 0.8
        reason = "Evaluation explanation"

        return GraderScore(name=self.name, score=score, reason=reason)

Submit Your Contribution

Contribution Workflow

Create a Branch
```
git checkout -b feat/your-feature-name
```
Make Changes - Write clear, focused commits - Add tests for new features - Update relevant documentation
Commit with Convention Format: `
(
):
` **Types:** `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`
```
git commit -m "feat(grader): add code quality grader"
git commit -m "fix(training): resolve memory leak in BT training"
git commit -m "docs(guide): update quickstart tutorial"
```
!!! info "Commit Rules" - Use imperative mood ("Add" not "Added") - Subject < 50 characters - Body wraps at 72 characters
Push and Open PR
```
git push origin feat/your-feature-name
```
Open a Pull Request on GitHub with: - Clear description of changes - Link to related issues - Test results (if applicable)

Contribute Documentation

Documentation Guidelines

Start with "What & Why" (1-2 sentences)
Use short paragraphs and bullet lists
Bold key terms for easy scanning
Include complete examples with qwen3-32b as model
See Documentation Style Guide for formatting details

Get Help

Need assistance? Here's how to reach us:

Type	Where to Go
Report Bugs	Open an Issue
Propose Features	Start a Discussion
Contact Team	Check README for communication channels

Before Starting Major Work

Before starting major work, open an issue to discuss your approach. This prevents duplicate efforts and ensures alignment with project goals.

Thank you for contributing to OpenJudge!