Algorithm-based graders for evaluating text similarity, string matching, and numerical accuracy. These graders don't require LLMs—they rely purely on algorithms and rules, offering fast execution, zero cost, and deterministic results.
Overview
| Grader | Purpose | Type | Score Range | Key Use Case |
|---|---|---|---|---|
SimilarityGrader |
Compute text similarity (15+ algorithms) | Code-Based | [0, 1] | Translation, summarization, answer matching |
StringMatchGrader |
String pattern matching (9+ modes) | Code-Based | {0, 1} | Format validation, keyword detection |
NumberAccuracyGrader |
Numerical accuracy checks | Code-Based | [0, 1] | Math calculations, data reports |
SimilarityGrader
Unified text similarity grader supporting multiple mainstream similarity algorithms. Choose the most suitable algorithm based on your scenario.
When to use: - Translation quality assessment (BLEU) - Text summarization evaluation (ROUGE) - Answer matching evaluation (F1 Score) - Semantic similarity computation (Cosine) - Fuzzy text matching (Fuzzy Match)
Supported Algorithms:
| Category | Algorithm | Description | Typical Use Case |
|---|---|---|---|
| N-gram Matching | bleu |
Standard BLEU, sacrebleu implementation | Machine translation |
sentence_bleu |
Sentence-level BLEU, NLTK implementation | Single sentence translation | |
gleu |
Google BLEU, more lenient | Grammar correction | |
chrf |
Character-level F-score | Morphologically rich languages | |
| Recall-Oriented | rouge1 |
Unigram recall | Content coverage |
rouge2 |
Bigram recall | Semantic coherence | |
rougeL |
Longest common subsequence | Summary quality | |
rouge3/4/5 |
Higher-order N-grams | Long text matching | |
| Balanced Metrics | f1_score |
Token-based F1 | Q&A systems |
meteor |
Considers synonyms and word order | Comprehensive translation quality | |
| Semantic Similarity | cosine |
TF-IDF + cosine similarity | Document similarity |
jaccard |
Set-based similarity | Keyword overlap | |
| Fuzzy Matching | fuzzy_match |
Levenshtein distance | Spelling tolerance |
edit_distance |
Normalized edit distance | Text difference |
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
reference_response |
str | Yes | Reference text |
response |
str | Yes | Text to evaluate |
algorithm |
str | Yes | Algorithm name (see table above) |
normalize |
bool | No | Whether to normalize text (default True) |
case_sensitive |
bool | No | Whether case-sensitive (default False) |
**kwargs |
Any | No | Algorithm-specific parameters |
Scoring:
- Score range: 0.0 - 1.0
- Specific meaning depends on chosen algorithm
- Generally: 1.0 = perfect match, 0.0 = no match
Examples:
BLEU Algorithm - Machine Translation Evaluation
import asyncio
from openjudge.graders.text.similarity import SimilarityGrader
async def main():
grader = SimilarityGrader(algorithm="bleu")
result = await grader.aevaluate(
reference_response="The cat is on the mat.",
response="The cat sits on the mat.",
)
print(f"Score: {result.score}") # 0.75-0.85 (good partial match)
print(f"Reason: {result.reason}")
asyncio.run(main())
ROUGE-L Algorithm - Summarization Quality
import asyncio
from openjudge.graders.text.similarity import SimilarityGrader
async def main():
grader = SimilarityGrader(algorithm="rougeL")
# Evaluate summarization quality
result = await grader.aevaluate(
reference_response="Artificial intelligence is transforming the technology industry.",
response="AI is changing tech.",
)
print(f"Score: {result.score}") # Based on longest common subsequence
print(f"Reason: {result.reason}")
asyncio.run(main())
F1 Score Algorithm - Q&A System Evaluation
import asyncio
from openjudge.graders.text.similarity import SimilarityGrader
async def main():
grader = SimilarityGrader(algorithm="f1_score", normalize=True)
result = await grader.aevaluate(
reference_response="Paris is the capital of France",
response="The capital of France is Paris",
)
print(f"Score: {result.score}") # ~1.0 (same tokens)
print(f"Precision: {result.metadata['precision']}")
print(f"Recall: {result.metadata['recall']}")
asyncio.run(main())
Cosine Similarity Algorithm - Semantic Similarity
import asyncio
from openjudge.graders.text.similarity import SimilarityGrader
async def main():
grader = SimilarityGrader(algorithm="cosine")
result = await grader.aevaluate(
reference_response="machine learning and artificial intelligence",
response="AI and ML technologies",
use_tfidf=True,
)
print(f"Score: {result.score}")
print(f"Reason: {result.reason}")
asyncio.run(main())
Fuzzy Match Algorithm - Fuzzy Matching
import asyncio
from openjudge.graders.text.similarity import SimilarityGrader
async def main():
grader = SimilarityGrader(algorithm="fuzzy_match")
# Fuzzy matching with spelling tolerance
result = await grader.aevaluate(
reference_response="Hello World",
response="Helo Wrld",
method="ratio", # 'ratio', 'partial_ratio', 'token_sort_ratio'
threshold=0.8,
)
print(f"Score: {result.score}")
print(f"Matched: {result.metadata['matched']}")
asyncio.run(main())
Algorithm-Specific Parameters:
BLEU Series:
- max_ngram_order (int): Maximum N-gram order (default 4)
- smooth_method (str): Smoothing method (exp, floor, add-k)
ROUGE Series:
- use_stemmer (bool): Whether to use stemming (default True)
- score_key (str): Score type (fmeasure, precision, recall)
METEOR:
- alpha (float): Precision weight (default 0.9)
- beta (float): Recall weight (default 3.0)
- gamma (float): Chunking penalty (default 0.5)
Cosine:
- use_tfidf (bool): Whether to use TF-IDF (default True)
- ngram_range (tuple): N-gram range (default (1, 2))
- max_features (int): Maximum features (default None)
StringMatchGrader
Unified string matching grader supporting multiple matching patterns. Use for format validation, keyword detection, and pattern matching.
When to use: - Format validation (email, phone numbers) - Keyword detection - Prefix/suffix checking - Exact answer verification - Regular expression matching
Supported Algorithms:
| Algorithm | Description | Return Value | Typical Use Case |
|---|---|---|---|
exact_match |
Exact string match | 1.0/0.0 | Answer verification |
prefix_match |
Check if response starts with text | 1.0/0.0 | Completion check |
suffix_match |
Check if response ends with text | 1.0/0.0 | Extension validation |
regex_match |
Regular expression matching | 1.0/0.0 | Format validation |
substring_match |
Substring containment | 1.0/0.0 | Keyword detection |
contains_all |
Contains all substrings | 0.0-1.0 | Multiple keywords |
contains_any |
Contains any substring | 1.0/0.0 | Alternative keywords |
word_overlap |
Word overlap ratio | 0.0-1.0 | Content coverage |
char_overlap |
Character overlap ratio | 0.0-1.0 | Character coverage |
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
reference_response |
str | Yes* | Reference text or pattern |
response |
str | Yes | Text to evaluate |
algorithm |
str | Yes | Algorithm name (see table above) |
case_sensitive |
bool | No | Whether case-sensitive (default False) |
ignore_whitespace |
bool | No | Whether to ignore whitespace (default False) |
**kwargs |
Any | No | Algorithm-specific parameters |
Note
For contains_all and contains_any algorithms, reference_response can be empty and pass substrings via the substrings parameter.
Scoring:
- Boolean algorithms: 1.0 (match) or 0.0 (no match)
- Overlap algorithms: 0.0 - 1.0 (overlap ratio)
Examples:
Exact Match - Answer Verification
import asyncio
from openjudge.graders.text.string_match import StringMatchGrader
async def main():
grader = StringMatchGrader(
algorithm="exact_match",
case_sensitive=False,
ignore_whitespace=True
)
result = await grader.aevaluate(
reference_response="Paris",
response="paris",
)
print(f"Score: {result.score}") # 1.0
print(f"Matched: {result.metadata['matched']}") # True
asyncio.run(main())
Regular Expression - Format Validation
import asyncio
from openjudge.graders.text.string_match import StringMatchGrader
async def main():
grader = StringMatchGrader(algorithm="regex_match")
# Validate email format
result = await grader.aevaluate(
reference_response=r"[\w.-]+@[\w.-]+\.\w+",
response="user@example.com",
)
print(f"Score: {result.score}") # 1.0
print(f"Reason: {result.reason}")
# Validate phone number format
result = await grader.aevaluate(
reference_response=r"\d{3}-\d{4}",
response="My phone is 123-4567",
)
print(f"Score: {result.score}") # 1.0
asyncio.run(main())
Keyword Detection - Contains All
import asyncio
from openjudge.graders.text.string_match import StringMatchGrader
async def main():
grader = StringMatchGrader(algorithm="contains_all", case_sensitive=False)
# Check if response contains all required keywords
result = await grader.aevaluate(
reference_response="", # reference_response not used
response="The quick brown fox jumps over the lazy dog",
substrings=["fox", "dog", "jumps"],
)
print(f"Score: {result.score}") # 1.0 - all keywords found
print(f"Matched: {result.metadata['matched']}") # True
# Partial match
result = await grader.aevaluate(
reference_response="",
response="The quick brown fox jumps over the lazy dog",
substrings=["fox", "cat", "dog"],
)
print(f"Score: {result.score}") # 0.67 - 2/3 keywords found
print(f"Missing: {result.metadata['missing_substrings']}") # ['cat']
asyncio.run(main())
Keyword Detection - Contains Any
import asyncio
from openjudge.graders.text.string_match import StringMatchGrader
async def main():
grader = StringMatchGrader(algorithm="contains_any")
# Check if response contains any of the keywords
result = await grader.aevaluate(
reference_response="",
response="The weather is sunny today",
substrings=["sunny", "cloudy", "rainy"],
)
print(f"Score: {result.score}") # 1.0
print(f"Matched: {result.metadata['matched_substrings']}") # ['sunny']
asyncio.run(main())
Prefix/Suffix Matching
import asyncio
from openjudge.graders.text.string_match import StringMatchGrader
async def main():
# Prefix matching
prefix_grader = StringMatchGrader(algorithm="prefix_match")
result = await prefix_grader.aevaluate(
reference_response="Hello",
response="Hello World",
)
print(f"Prefix Score: {result.score}") # 1.0
# Suffix matching
suffix_grader = StringMatchGrader(algorithm="suffix_match")
result = await suffix_grader.aevaluate(
reference_response="World",
response="Hello World",
)
print(f"Suffix Score: {result.score}") # 1.0
asyncio.run(main())
Word Overlap - Content Coverage
import asyncio
from openjudge.graders.text.string_match import StringMatchGrader
async def main():
grader = StringMatchGrader(algorithm="word_overlap", case_sensitive=False)
result = await grader.aevaluate(
reference_response="the cat sat on the mat",
response="the dog sat on the rug",
)
# Overlapping words: "the", "sat", "on" (3)
# Unique words in reference_response: "the", "cat", "sat", "on", "mat" (5)
print(f"Score: {result.score}") # 0.6 (3/5)
print(f"Overlap Ratio: {result.metadata['overlap_ratio']}")
asyncio.run(main())
Algorithm-Specific Parameters:
exact_match:
- case_sensitive (bool): Case-sensitive matching (default True)
- ignore_whitespace (bool): Ignore whitespace (default False)
regex_match:
- pattern (str): Regular expression pattern (can replace ground_truth)
- case_sensitive (bool): Case-sensitive matching (default True)
substring_match:
- bidirectional (bool): Bidirectional matching (default False)
contains_all/contains_any:
- substrings (List[str]): List of substrings to detect
NumberAccuracyGrader
Check numerical calculation accuracy by comparing numbers extracted from text.
When to use: - Math calculation verification - Data report accuracy - Quantitative metric checking - Numerical Q&A evaluation
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
response |
str | Yes | Text to evaluate |
reference_response |
str | Yes | Reference answer text |
tolerance |
float | No | Numerical tolerance (default 1e-6) |
Scoring:
- Score range: 0.0 - 1.0
- Calculation: correct numbers / total reference numbers
- 1.0: All numbers correct
- 0.5: Half numbers correct
- 0.0: No correct numbers or no numbers to compare
Examples:
Basic Numerical Verification
import asyncio
from openjudge.graders.text.number_accuracy import NumberAccuracyGrader
async def main():
grader = NumberAccuracyGrader(tolerance=1e-6)
# Perfect match
result = await grader.aevaluate(
response="The result is 3.14159",
reference_response="The result is 3.14159",
)
print(f"Score: {result.score}") # 1.0
print(f"Reason: {result.reason}") # "Number accuracy: 1/1 numbers correct"
asyncio.run(main())
Multiple Number Verification
import asyncio
from openjudge.graders.text.number_accuracy import NumberAccuracyGrader
async def main():
grader = NumberAccuracyGrader(tolerance=0.01)
result = await grader.aevaluate(
response="Temperature readings: 25.5°C, 30.2°C, 28.7°C",
reference_response="Expected values: 25.5°C, 30.0°C, 28.7°C",
)
# Number matching: 25.5 ✓, 30.2 ✗ (vs 30.0), 28.7 ✓
print(f"Score: {result.score}") # 0.67 (2/3)
print(f"Correct: {result.metadata['correct_numbers']}") # 2
print(f"Total: {result.metadata['total_reference_response_numbers']}") # 3
asyncio.run(main())
Math Calculation Verification
import asyncio
from openjudge.graders.text.number_accuracy import NumberAccuracyGrader
async def main():
grader = NumberAccuracyGrader(tolerance=1e-6)
# Verify calculation results
result = await grader.aevaluate(
response="Area = 78.54 square units, Perimeter = 31.42 units",
reference_response="Area = 78.54, Perimeter = 31.42",
)
print(f"Score: {result.score}") # 1.0
print(f"Response Numbers: {result.metadata['response_numbers']}")
print(f"Reference Numbers: {result.metadata['reference_response_numbers']}")
asyncio.run(main())
Custom Tolerance
import asyncio
from openjudge.graders.text.number_accuracy import NumberAccuracyGrader
async def main():
# Loose tolerance - for approximate calculations
loose_grader = NumberAccuracyGrader(tolerance=0.1)
result = await loose_grader.aevaluate(
response="The value is approximately 3.14",
reference_response="The exact value is 3.14159",
)
print(f"Score (loose): {result.score}") # 1.0 (3.14 vs 3.14159 within tolerance)
# Strict tolerance - for high precision
strict_grader = NumberAccuracyGrader(tolerance=1e-9)
result = await strict_grader.aevaluate(
response="The value is approximately 3.14",
reference_response="The exact value is 3.14159",
)
print(f"Score (strict): {result.score}") # 0.0 (exceeds strict tolerance)
asyncio.run(main())
How It Works:
- Extract all numbers (integers and floats) from both texts
- Compare numbers in order of appearance
- Use specified tolerance to determine matches
- Return match ratio as score
Important Notes:
- Numbers are matched in order of appearance, position-independent
- Supports negative numbers and floats
- Non-numeric text content is ignored
- Returns 0.0 if reference text has no numbers
Best Practices
1. Choose Appropriate Normalization
# Case-insensitive scenario
grader = SimilarityGrader(
algorithm="f1_score",
normalize=True, # converts to lowercase
case_sensitive=False
)
# Strict format scenario
grader = StringMatchGrader(
algorithm="exact_match",
case_sensitive=True,
ignore_whitespace=False
)
2. Combine Multiple Algorithms for Comprehensive Evaluation
# Evaluate both exactness and similarity
exact_grader = StringMatchGrader(algorithm="exact_match")
fuzzy_grader = SimilarityGrader(algorithm="fuzzy_match")
exact_result = await exact_grader.aevaluate(reference_response="...", response="...")
fuzzy_result = await fuzzy_grader.aevaluate(reference_response="...", response="...")
# Decision logic: prioritize exact match, fallback to fuzzy
if exact_result.score == 1.0:
final_score = 1.0
elif fuzzy_result.score > 0.8:
final_score = 0.8
else:
final_score = fuzzy_result.score
3. Tune Parameters Based on Data Characteristics
# Science calculations - strict tolerance
scientific_grader = NumberAccuracyGrader(tolerance=1e-9)
# Engineering calculations - loose tolerance
engineering_grader = NumberAccuracyGrader(tolerance=0.01)
# Short text - high-order N-grams
short_grader = SimilarityGrader(algorithm="bleu", max_ngram_order=2)
# Long text - standard N-grams
long_grader = SimilarityGrader(algorithm="bleu", max_ngram_order=4)
4. Deep Analysis Using Metadata
result = await grader.aevaluate(reference_response="...", response="...")
# Check detailed metrics
print(f"Score: {result.score}")
print(f"Precision: {result.metadata.get('precision', 'N/A')}")
print(f"Recall: {result.metadata.get('recall', 'N/A')}")
print(f"Algorithm: {result.metadata['algorithm']}")
# Adjust strategy based on metadata
if result.metadata.get('recall', 0) < 0.5:
print("Warning: Low recall - response may be incomplete")
Performance Characteristics
| Grader | Avg Latency | Throughput | Memory | Thread-Safe |
|---|---|---|---|---|
SimilarityGrader (BLEU) |
< 1ms | > 10K/s | Very Low | ✓ |
SimilarityGrader (ROUGE) |
< 5ms | > 5K/s | Low | ✓ |
SimilarityGrader (Cosine) |
< 10ms | > 2K/s | Moderate | ✓ |
StringMatchGrader |
< 0.5ms | > 20K/s | Very Low | ✓ |
NumberAccuracyGrader |
< 1ms | > 10K/s | Very Low | ✓ |
Performance Note
Performance metrics are based on typical text length (100-500 tokens). Actual performance may vary based on text length and hardware configuration.
Next Steps
- Format Graders — Validate structured outputs and formatting
- Create Custom Graders — Build specialized text graders
- Build Reward for Training — Combine graders for RLHF rewards