Multimodal Graders

Vision-language graders for evaluating AI responses involving images. These graders assess image-text coherence, image helpfulness, and text-to-image generation quality.

Overview

Grader	Purpose	Type	Score Range	Key Use Case
`ImageCoherenceGrader`	Measures image-text alignment	LLM-Based	[0, 1]	Document generation, content QA
`ImageHelpfulnessGrader`	Evaluates image contribution to understanding	LLM-Based	[0, 1]	Educational content, tutorials
`TextToImageGrader`	Assesses generated image quality	LLM-Based	[0, 1]	Text-to-image model evaluation

Performance

Benchmark results using qwen-vl-max as the judge model:

Grader	Samples	Preference Accuracy	Avg Score Diff
ImageCoherenceGrader	20	75.00%	0.23
ImageHelpfulnessGrader	20	80.00%	0.18
TextToImageGrader	20	75.00%	0.26

Performance Metrics

Preference Accuracy measures alignment with human-annotated preference labels. Higher is better.

MLLMImage

All multimodal graders use MLLMImage to represent images. It supports both URL and base64 formats.

from openjudge.graders.multimodal import MLLMImage

# From URL
image = MLLMImage(url="https://example.com/image.jpg")

# From base64
image = MLLMImage(base64="iVBORw0KGgoAAAANS...", format="png")

ImageCoherenceGrader

Evaluates how well images match and relate to their surrounding text context. Assesses whether images are appropriately placed and meaningfully connected to the content.

When to use: - Document generation with embedded images - Multimodal content quality assurance - Educational material evaluation - Technical documentation review

Parameters:

Parameter	Type	Required	Description
`response`	List[str \| MLLMImage]	Yes	Mixed list of text and images
`max_context_size`	int	No	Max characters from context (default: 500)

What it evaluates: - Semantic alignment between image and surrounding text - Contextual relevance to preceding and following content - Visual-text consistency - Placement appropriateness

Scoring: - 10: Perfect coherence, image perfectly illustrates text - 7-9: Strong coherence with clear relationship - 4-6: Some coherence but connection could be clearer - 0-3: Weak or no coherence, image seems misplaced

Note

Score is normalized to [0, 1]. For multiple images, returns average score.

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.multimodal import ImageCoherenceGrader, MLLMImage

async def main():
    model = OpenAIChatModel(model="qwen-vl-max")
    grader = ImageCoherenceGrader(model=model)

    result = await grader.aevaluate(
        response=[
            "Q3 sales increased by 25% compared to last quarter.",
            MLLMImage(url="https://example.com/sales_chart.jpg"),
            "This growth was primarily driven by new product launches.",
        ]
    )

    print(f"Score: {result.score}")   # 0.95 - image coherent with context
    print(f"Reason: {result.reason}")

asyncio.run(main())

ImageHelpfulnessGrader

Evaluates how helpful images are in aiding readers' understanding of text. Goes beyond simple coherence to assess whether images provide genuine value and clarify concepts.

When to use: - Educational content evaluation - Technical documentation quality assurance - Tutorial and how-to guide assessment - Instructional design evaluation - User manual review

Parameters:

Parameter	Type	Required	Description
`response`	List[str \| MLLMImage]	Yes	Mixed list of text and images
`max_context_size`	int	No	Max characters from context (default: 500)

What it evaluates: - Information enhancement beyond text - Concept clarification - Practical utility vs. decorative value - Educational value - Comprehension support

Scoring: - 10: Extremely helpful, significantly enhances understanding - 7-9: Very helpful, provides clear value - 4-6: Somewhat helpful but limited value - 0-3: Not helpful or redundant with text

Note

Score is normalized to [0, 1]. For multiple images, returns average score.

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.multimodal import ImageHelpfulnessGrader, MLLMImage

async def main():
    model = OpenAIChatModel(model="qwen-vl-max")
    grader = ImageHelpfulnessGrader(model=model)

    result = await grader.aevaluate(
        response=[
            "The system architecture consists of three main layers.",
            MLLMImage(url="https://example.com/architecture_diagram.jpg"),
            "Each layer handles specific responsibilities.",
        ]
    )

    print(f"Score: {result.score}")   # 0.90 - diagram very helpful
    print(f"Reason: {result.reason}")

asyncio.run(main())

TextToImageGrader

Evaluates AI-generated images from text prompts by measuring semantic consistency (prompt following) and perceptual quality (visual realism). Essential for text-to-image model evaluation.

When to use: - Text-to-image model benchmarking (DALL-E, Stable Diffusion, etc.) - Prompt engineering effectiveness evaluation - Generative model quality control - A/B testing different generation parameters

Parameters:

Parameter	Type	Required	Description
`query`	str	Yes	The text prompt used for generation
`response`	MLLMImage	Yes	The generated image to evaluate

What it evaluates: - Semantic Consistency: Image accurately reflects prompt description - Element Presence: All requested elements are included - Visual Quality: Image looks natural and realistic - Artifact Detection: No distortions, blur, or unnatural features - Composition: Proper spatial arrangement and aesthetics

Scoring:

The final score combines two dimensions: - Semantic Score (0-10): How well the image follows the prompt - Perceptual Score (0-10): Naturalness + artifact absence

Formula: sqrt(semantic × min(perceptual)) / 10 → normalized to [0, 1]

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.multimodal import TextToImageGrader, MLLMImage

async def main():
    model = OpenAIChatModel(model="qwen-vl-max")
    grader = TextToImageGrader(model=model)

    result = await grader.aevaluate(
        query="A fluffy orange cat sitting on a blue velvet sofa",
        response=MLLMImage(url="https://example.com/generated_cat.jpg"),
    )

    print(f"Score: {result.score}")   # 0.92 - excellent generation
    print(f"Reason: {result.reason}")

    # Access detailed scores
    print(f"Semantic: {result.metadata['min_sc']}/10")
    print(f"Perceptual: {result.metadata['min_pq']}/10")

asyncio.run(main())

Next Steps

Code & Math Graders — Evaluate code generation and mathematical problem-solving
Text Graders — Fast, deterministic text comparison using various algorithms
Build Reward for Training — Combine multiple graders for RLHF rewards