Metrics | Twinkle

TrainMetric

Mon, 01 Jan 0001 00:00:00 +0000

Training metrics are used to measure the state during the training process. Training metrics include current learning rate, current step, total training time, training speed and other training metrics.

from twinkle.metric import TrainMetric
metric = TrainMetric()
metric.accumulate(None, None, lr=0.0001, step=10, gradient_accumulation_steps=16)
...
_metric = metric.calculate()

TrainMetric does not need device_mesh and process_group information, nor does it need inputs and outputs information

LossMetric

Mon, 01 Jan 0001 00:00:00 +0000

LossMetric is used to print and evaluate loss and grad_norm information

from twinkle.metric import LossMetric
from twinkle.data_format import InputFeature, ModelOutput
metric = LossMetric(device_mesh=..., process_group=...)
metric.accumulate(InputFeature(labels=...), ModelOutput(loss=...), grad_norm=...)
...
_metric = metric.calculate()

Accuracy

Mon, 01 Jan 0001 00:00:00 +0000

The accuracy metric is used to measure token-level accuracy information during training.

from twinkle.metric import Accuracy
from twinkle.data_format import InputFeature, ModelOutput
metric = Accuracy(device_mesh=..., process_group=...)
metric.accumulate(InputFeature(labels=...), ModelOutput(logits=...))
...
_metric = metric.calculate()

Accuracy does not currently support List[InputFeature] as input, meaning support for Megatron is yet to be adapted.

CompletionRewardMetric

Mon, 01 Jan 0001 00:00:00 +0000

The CompletionRewardMetric aggregates key statistics during RLHF training, including generation time, weight synchronization time, reward scores, and completion lengths.

from twinkle.metric import CompletionRewardMetric

metric = CompletionRewardMetric(device_mesh=..., process_group=...)

# Accumulate during training loop
metric.accumulate(
 inputs,
 outputs,
 generation_time=gen_time,
 weight_sync_time=sync_time,
 rewards=reward_values,
 completions=completion_texts,
)

# Calculate aggregated metrics
result = metric.calculate()
# result contains: generation_time, weight_sync_time, mean_reward, mean_completion_length, etc.

This metric is designed for GRPO and other RL training loops where monitoring generation quality and system performance is essential.

CompletionRewardMetric performs DP-aware aggregation, correctly averaging metrics across all data-parallel ranks.

DPOMetric

Mon, 01 Jan 0001 00:00:00 +0000

The DPOMetric tracks preference optimization-specific statistics during DPO training.

from twinkle.metric import DPOMetric

metric = DPOMetric(device_mesh=..., process_group=...)

# Accumulate after each forward pass
metric.accumulate(inputs, outputs, ref_outputs=ref_outputs)

# Calculate aggregated metrics
result = metric.calculate()

Tracked metrics:

chosen_logps: Average log-probability of chosen responses
rejected_logps: Average log-probability of rejected responses
ref_chosen_logps: Reference model log-probability of chosen responses
ref_rejected_logps: Reference model log-probability of rejected responses
rewards/chosen: Implicit reward for chosen responses
rewards/rejected: Implicit reward for rejected responses
accuracy: Fraction of pairs where chosen is preferred over rejected
margin: Average reward margin between chosen and rejected

DPOMetric performs DP-aware aggregation across all data-parallel ranks. It supports both interleaved and separate chosen/rejected batch formats.

GRPOMetric

Mon, 01 Jan 0001 00:00:00 +0000

The GRPOMetric tracks policy optimization diagnostics during GRPO training, including KL divergence, clipping rates, entropy, and log-probability statistics.

Usage

from twinkle.metric import GRPOMetric

metric = GRPOMetric(
 device_mesh=device_mesh,
 process_group=process_group,
 epsilon=0.2, # PPO clip range
 temperature=1.0, # Sampling temperature for logp rescaling
 top_k_kl=10, # Track top-K high-KL tokens per step
)

# During training loop
metric.accumulate(inputs, outputs, old_logps=old_logps, advantages=advantages)

# At log interval
results = metric.calculate()
# results: {
# 'train/policy_confidence': 0.85,
# 'train/mean_new_logp': -1.23,
# 'train/mean_old_logp': -1.30,
# 'train/logp_diff_mean': 0.07,
# 'train/approx_kl': 0.003,
# 'train/token_kl_max': 0.15,
# 'train/entropy': 2.1,
# 'train/clip_ratio': 0.02,
# 'train/clip_ratio_low': 0.01,
# 'train/clip_ratio_high': 0.01,
# }

Reported Metrics

Metric	Description
`train/policy_confidence`	exp(mean_new_logp) — higher means model is more confident
`train/mean_new_logp`	Average log-probability of generated tokens under current policy
`train/mean_old_logp`	Average log-probability under reference policy
`train/logp_diff_mean`	Mean (new - old) log-probability difference
`train/approx_kl`	Schulman K3 estimator of KL(old \|\| new)
`train/token_kl_max`	Maximum per-token KL across all ranks
`train/token_ratio_max`	Maximum importance weight across all ranks
`train/entropy`	Average token-level entropy
`train/clip_ratio`	Fraction of tokens clipped (low + high)
`train/clip_ratio_low`	Fraction clipped below (ratio < 1-ε, negative advantage)
`train/clip_ratio_high`	Fraction clipped above (ratio > 1+ε, positive advantage)

Variants

GSPOMetric — Computes clip rate at sequence level (geometric-mean ratio per sequence)
CISPOMetric — Unconditional clip rate (not gated by advantage sign)

Parameters

Parameter	Type	Default	Description
`epsilon`	float	0.2	Lower clip boundary
`epsilon_high`	float	None	Upper clip boundary (defaults to epsilon)
`temperature`	float	1.0	Rescale logps to T=1 before computing KL
`top_k_kl`	int	0	If > 0, record top-K high-KL token details
`ignore_index`	int	-100	Label value to mask out

EmbeddingMetric

Mon, 01 Jan 0001 00:00:00 +0000

The EmbeddingMetric tracks embedding quality during contrastive (InfoNCE) training. It reports anchor-positive cosine similarity statistics and in-batch negative similarity.

Usage

from twinkle.metric import EmbeddingMetric

metric = EmbeddingMetric(device_mesh=device_mesh, process_group=process_group)

# During training
metric.accumulate(inputs, outputs)

# At log interval
results = metric.calculate()
# results: {
# 'pos_sim': '0.8523', # Mean anchor-positive cosine similarity
# 'pos_sim_min': '0.7102', # Min across batch
# 'pos_sim_max': '0.9451', # Max across batch
# 'neg_sim': '0.2134', # Mean anchor-negative similarity
# 'loss': '0.3412', # Average InfoNCE loss
# 'grad_norm': '1.234567', # Gradient norm
# }

Reported Metrics

Metric	Description
`pos_sim`	Mean cosine similarity between anchors and their positives
`pos_sim_min`	Minimum anchor-positive similarity in the batch
`pos_sim_max`	Maximum anchor-positive similarity in the batch
`neg_sim`	Mean similarity between anchors and other positives (in-batch negatives)
`loss`	Average contrastive loss value
`grad_norm`	Gradient norm (passed via kwargs)

Cross-Rank Gathering

EmbeddingMetric performs an all_gather to compute similarity statistics across all DP ranks, providing a global view of embedding quality even under data-parallel training.

This metric pairs with InfonceLoss for embedding/retrieval training tasks.

Building Metrics

Mon, 01 Jan 0001 00:00:00 +0000

Metrics are used to measure the training process and training results. The metric component is part of the customizable components.

class Metric:

 def __init__(self, device_mesh, process_group, **kwargs):
 self.process_group = process_group
 self.device_mesh = device_mesh

 # Due to the existence of microbatch, the inputs to Metric may be a List
 def accumulate(self, inputs: 'Union[InputFeature, List[InputFeature]]', outputs: 'ModelOutput'):
 ...

 def calculate(self):
 ...

 def reset(self):
 ...

Metrics cannot be passed in through Callable. Because it contains two parts: accumulate and calculate, and needs to support reset to zero out. The device_mesh and process_group belonging to the current dp group are automatically passed in during the construction of the metric for cross-process communication. Moreover, in the actual implementation, the base class provides a gather_results method to assist in collecting input results from various processes.