Get started with OpenJudge in 5 minutes. This guide walks you through installation, environment setup, and running your first evaluation.
Installation
# Install with Standard dependencies from PyPI
pip install py-openjudge
# For development purposes, install with dev dependencies
pip install -e .[dev]
git clone https://github.com/modelscope/OpenJudge.git
cd OpenJudge
# Install based on your needs:
pip install -e . # Standard installation
pip install -e .[dev] # With development dependencies
pip install -e .[verl] # With VerL option for training scenarios
Tips: OpenJudge requires Python version >=3.10 and <3.13. For best compatibility, we recommend using Python 3.10 or 3.11.
Configure Environment
For LLM-Based graders, you need to configure API credentials. OpenJudge uses the OpenAI-compatible API format.
# Set environment variables in your terminal
# OpenAI
export OPENAI_API_KEY="sk-your-api-key"
export OPENAI_BASE_URL="https://api.openai.com/v1"
# DashScope (Qwen)
export OPENAI_API_KEY="sk-your-dashscope-key"
export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
# Pass credentials directly when creating the model
from openjudge.models import OpenAIChatModel
model = OpenAIChatModel(
model="qwen3-32b",
api_key="sk-your-api-key",
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)
Security Best Practice
Environment variables are more secure and convenient. The model will automatically use OPENAI_API_KEY and OPENAI_BASE_URL if set.
Choose a Grader for Your Scenario
Suppose you're building a QA system and want to evaluate: Does the AI assistant's response actually answer the user's question?
This is a relevance evaluation task. OpenJudge provides the RelevanceGrader for exactly this purpose—it scores how well a response addresses the query on a 1-5 scale.
| Your Scenario | Recommended Grader |
|---|---|
| Does the response answer the question? | RelevanceGrader |
| Is the response harmful or unsafe? | HarmfulnessGrader |
| Does the response follow instructions? | InstructionFollowingGrader |
| Is the response factually correct? | CorrectnessGrader |
| Does the response contain hallucinations? | HallucinationGrader |
For a complete list of available graders, see Built-in Graders.
In this quickstart, we'll use RelevanceGrader to evaluate a QA response.
Prepare Your Data
Prepare a dictionary with query and response fields. These field names correspond to the input parameters of the grader's aevaluate() method:
data = {
"query": "What is machine learning?",
"response": "Machine learning is a subset of artificial intelligence that enables computers to learn patterns from data without being explicitly programmed. It uses algorithms to build models that can make predictions or decisions.",
}
Initialize Model and Grader
Create the LLM model and the RelevanceGrader to evaluate how well the response addresses the query:
from openjudge.models import OpenAIChatModel
from openjudge.graders.common.relevance import RelevanceGrader
# Create the judge model (uses OPENAI_API_KEY and OPENAI_BASE_URL from env)
model = OpenAIChatModel(model="qwen3-32b")
# Create the grader
grader = RelevanceGrader(model=model)
What is a Grader?
A Grader is the core evaluation component in OpenJudge. It takes a query-response pair and returns a score with an explanation. Learn more in Core Concepts.
Run Evaluation
All graders use async/await. Evaluate your data with aevaluate():
import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common.relevance import RelevanceGrader
async def main():
# Initialize model and grader
model = OpenAIChatModel(model="qwen3-32b")
grader = RelevanceGrader(model=model)
# Prepare data
data = {
"query": "What is machine learning?",
"response": "Machine learning is a subset of artificial intelligence that enables computers to learn patterns from data without being explicitly programmed. It uses algorithms to build models that can make predictions or decisions.",
}
# Run evaluation
result = await grader.aevaluate(**data)
# Print result
print(result)
asyncio.run(main())
Output:
GraderScore(
name='relevance',
score=5.0,
reason="The response directly and clearly defines machine learning as a subset of artificial intelligence, explains its purpose (learning patterns from data without explicit programming), and mentions the use of algorithms to build predictive models. It is concise, on-topic, and fully addresses the user's question."
)
Understanding the Output
The RelevanceGrader returns a GraderScore object with the following fields:
| Field | Description | Example Value |
|---|---|---|
name |
Identifier of the grader | "relevance" |
score |
Relevance score from 1 (irrelevant) to 5 (perfectly relevant) | 5.0 |
reason |
LLM-generated explanation for the score | "The response directly and clearly..." |
Score Interpretation:
- 5 (Perfectly relevant): Response completely fulfills the query, accurately answering the question
- 4 (Highly relevant): Response largely meets requirements, possibly missing minor details
- 3 (Partially relevant): Response has some connection but doesn't fully meet requirements
- 2 (Weakly relevant): Response has only weak connection, low practical value
- 1 (Irrelevant): Response is completely unrelated or contains misleading information
In this example, the response received a score of 5 because it directly defines machine learning, explains the core mechanism, and provides relevant context—fully satisfying the user's query.
Next Steps
- Core Concepts — Understand graders, scoring modes, and result types
- Built-in Graders — Explore all available graders
- Create Custom Graders — Build your own evaluation logic