Quick Start

Get started with OpenJudge in 5 minutes. This guide walks you through installation, environment setup, and running your first evaluation.

Installation

From PyPIFrom Source

# Install with Standard dependencies from PyPI
pip install py-openjudge

# For development purposes, install with dev dependencies
pip install -e .[dev]

git clone https://github.com/modelscope/OpenJudge.git
cd OpenJudge

# Install based on your needs:
pip install -e .        # Standard installation
pip install -e .[dev]   # With development dependencies
pip install -e .[verl]  # With VerL option for training scenarios

Tips: OpenJudge requires Python version >=3.10 and <3.13. For best compatibility, we recommend using Python 3.10 or 3.11.

Configure Environment

For LLM-Based graders, you need to configure API credentials. OpenJudge uses the OpenAI-compatible API format.

Environment Variables (Recommended)Pass Directly in Code

# Set environment variables in your terminal

# OpenAI
export OPENAI_API_KEY="sk-your-api-key"
export OPENAI_BASE_URL="https://api.openai.com/v1"

# DashScope (Qwen)
export OPENAI_API_KEY="sk-your-dashscope-key"
export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"

# Pass credentials directly when creating the model
from openjudge.models import OpenAIChatModel

model = OpenAIChatModel(
    model="qwen3-32b",
    api_key="sk-your-api-key",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

Security Best Practice

Environment variables are more secure and convenient. The model will automatically use OPENAI_API_KEY and OPENAI_BASE_URL if set.

Choose a Grader for Your Scenario

Suppose you're building a QA system and want to evaluate: Does the AI assistant's response actually answer the user's question?

This is a relevance evaluation task. OpenJudge provides the RelevanceGrader for exactly this purpose—it scores how well a response addresses the query on a 1-5 scale.

Your Scenario	Recommended Grader
Does the response answer the question?	`RelevanceGrader`
Is the response harmful or unsafe?	`HarmfulnessGrader`
Does the response follow instructions?	`InstructionFollowingGrader`
Is the response factually correct?	`CorrectnessGrader`
Does the response contain hallucinations?	`HallucinationGrader`

For a complete list of available graders, see Built-in Graders.

In this quickstart, we'll use RelevanceGrader to evaluate a QA response.

Prepare Your Data

Prepare a dictionary with query and response fields. These field names correspond to the input parameters of the grader's aevaluate() method:

data = {
    "query": "What is machine learning?",
    "response": "Machine learning is a subset of artificial intelligence that enables computers to learn patterns from data without being explicitly programmed. It uses algorithms to build models that can make predictions or decisions.",
}

Initialize Model and Grader

Create the LLM model and the RelevanceGrader to evaluate how well the response addresses the query:

from openjudge.models import OpenAIChatModel
from openjudge.graders.common.relevance import RelevanceGrader

# Create the judge model (uses OPENAI_API_KEY and OPENAI_BASE_URL from env)
model = OpenAIChatModel(model="qwen3-32b")

# Create the grader
grader = RelevanceGrader(model=model)

What is a Grader?

A Grader is the core evaluation component in OpenJudge. It takes a query-response pair and returns a score with an explanation. Learn more in Core Concepts.

Run Evaluation

All graders use async/await. Evaluate your data with aevaluate():

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common.relevance import RelevanceGrader

async def main():
    # Initialize model and grader
    model = OpenAIChatModel(model="qwen3-32b")
    grader = RelevanceGrader(model=model)

    # Prepare data
    data = {
        "query": "What is machine learning?",
        "response": "Machine learning is a subset of artificial intelligence that enables computers to learn patterns from data without being explicitly programmed. It uses algorithms to build models that can make predictions or decisions.",
    }

    # Run evaluation
    result = await grader.aevaluate(**data)

    # Print result
    print(result)

asyncio.run(main())

Output:

GraderScore(
    name='relevance',
    score=5.0,
    reason="The response directly and clearly defines machine learning as a subset of artificial intelligence, explains its purpose (learning patterns from data without explicit programming), and mentions the use of algorithms to build predictive models. It is concise, on-topic, and fully addresses the user's question."
)

Understanding the Output

The RelevanceGrader returns a GraderScore object with the following fields:

Field	Description	Example Value
`name`	Identifier of the grader	`"relevance"`
`score`	Relevance score from 1 (irrelevant) to 5 (perfectly relevant)	`5.0`
`reason`	LLM-generated explanation for the score	`"The response directly and clearly..."`

Score Interpretation:

5 (Perfectly relevant): Response completely fulfills the query, accurately answering the question
4 (Highly relevant): Response largely meets requirements, possibly missing minor details
3 (Partially relevant): Response has some connection but doesn't fully meet requirements
2 (Weakly relevant): Response has only weak connection, low practical value
1 (Irrelevant): Response is completely unrelated or contains misleading information

In this example, the response received a score of 5 because it directly defines machine learning, explains the core mechanism, and provides relevant context—fully satisfying the user's query.

Next Steps

Core Concepts — Understand graders, scoring modes, and result types
Built-in Graders — Explore all available graders
Create Custom Graders — Build your own evaluation logic