Loss | Twinkle

InfoNCE Loss

Mon, 01 Jan 0001 00:00:00 +0000

The InfonceLoss implements contrastive learning with in-batch negatives and optional cross-rank gathering. It is designed for embedding/retrieval model training.

Usage

from twinkle.loss import InfonceLoss

loss_fn = InfonceLoss(
 temperature=0.1,
 use_batch=True, # Enable in-batch negatives
 hard_negatives=7, # Fix negative count per sample
 mask_fake_negative=True, # Mask false negatives
 fake_neg_margin=0.1, # Margin for false negative detection
)

model.set_loss(loss_fn)

Input Format

Each sample is laid out as anchor(1) + positive(1) + negatives(n) in a flat embedding tensor. The inputs['labels'] is a 1-D mask where 1 marks the start of each group.

embeddings: [a0, p0, n0_1, n0_2, a1, p1, n1_1, n1_2, ...]
labels: [ 1, 0, 0, 0, 1, 0, 0, 0, ...]

Parameters

Parameter	Type	Default	Description
`temperature`	float	0.1	Logit scaling factor
`use_batch`	bool	True	Use cross-sample in-batch negatives
`hard_negatives`	int	None	Fix per-sample negative count (truncate/upsample)
`mask_fake_negative`	bool	False	Mask logits > positive + margin
`fake_neg_margin`	float	0.1	Threshold for false negative masking
`include_qq`	bool	False	Add query-query similarity block
`include_dd`	bool	False	Add doc-doc similarity block

Cross-Rank Gathering

When use_batch=True and distributed training is active, embeddings are gathered from all DP ranks to maximize in-batch negative diversity. Only the local shard retains gradients.

Similarity Blocks

The loss supports three similarity blocks for comprehensive contrastive learning:

Q→D (default): Query to all documents — primary contrastive signal
Q→Q (include_qq=True): Query to all other queries — prevents query collapse
D→D (include_dd=True): Document to all other documents — Qwen3-Embedding style

Example: Embedding Training

from twinkle.loss import InfonceLoss
from twinkle.metric import EmbeddingMetric

# Configure model for embedding
model.set_loss(InfonceLoss(temperature=0.05, use_batch=True, include_qq=True))
model.set_metric(EmbeddingMetric(device_mesh=mesh, process_group=pg))

# Training loop
for batch in dataloader:
 model.forward_backward(batch)
 model.clip_grad_and_step()

Cross Entropy

Mon, 01 Jan 0001 00:00:00 +0000

Cross entropy is the most commonly used type of loss in model SFT and PT training. It is used for accurate probability fitting of labels.

class CrossEntropyLoss(Loss):

 def __init__(self, **kwargs):
 self.reduction = kwargs.get('reduction', 'mean')

 def __call__(self, inputs, outputs, **kwargs):
 import torch
 logits = outputs['logits'].view(-1, outputs['logits'].shape[-1])
 labels = inputs['labels'].view(-1)
 return torch.nn.CrossEntropyLoss(reduction=self.reduction)(logits, labels)

The reduction parameter can be passed in during construction, supporting sum, mean, none, etc. (same as torch.nn.CrossEntropyLoss input).

Currently using sum in Transformers models. The purpose is to count the number of valid tokens before optimizer.step and take the average of single tokens at the grad level.

Chunked Cross Entropy

Mon, 01 Jan 0001 00:00:00 +0000

A memory-efficient variant of cross-entropy loss that processes the vocabulary dimension in chunks to reduce peak GPU memory usage.

from twinkle.loss import ChunkedCrossEntropyLoss

loss_fn = ChunkedCrossEntropyLoss(
 chunk_size=1024, # vocabulary chunk size
 reduction='mean',
)

model.set_loss(loss_fn)

Parameters:

chunk_size: Number of vocabulary tokens to process per chunk (default: 1024)
reduction: Reduction mode — sum, mean, or none

The implementation uses a custom autograd function that splits the logit-to-loss computation into chunks along the vocabulary dimension. This avoids materializing the full [batch*seq_len, vocab_size] probability tensor, significantly reducing memory for large vocabularies.

Useful when training with large vocabulary models where standard cross-entropy causes OOM errors.

DPO Loss

Mon, 01 Jan 0001 00:00:00 +0000

Direct Preference Optimization (DPO) and its variants are used for aligning models with human preferences without requiring a separate reward model.

DPOLoss

The standard DPO loss supports multiple loss types and optional reference-free mode.

from twinkle.loss import DPOLoss

loss_fn = DPOLoss(
 loss_type='sigmoid', # 'sigmoid', 'hinge', 'ipo', 'kto'
 beta=0.1,
 sft_weight=0.0, # optional SFT regularization weight
 reference_free=False,
)

model.set_loss(loss_fn)

Parameters:

loss_type: DPO variant — sigmoid (default), hinge, ipo, or kto
beta: Temperature parameter controlling preference strength
sft_weight: Weight for an additional SFT loss term on chosen responses
reference_free: If True, skips reference model log-probabilities

The loss expects interleaved chosen/rejected pairs in the batch. It computes sequence-level log-probabilities and optimizes the policy to prefer chosen over rejected responses.

SimPOLoss

Simplified Preference Optimization that removes the need for a reference model by using length-normalized log-probabilities.

from twinkle.loss import SimPOLoss

loss_fn = SimPOLoss(beta=2.0, gamma=1.0)

Parameters:

beta: Scaling factor for the logit difference
gamma: Margin term added to preference gap

CPOLoss

Contrastive Preference Optimization combines preference learning with behavior cloning.

from twinkle.loss import CPOLoss

loss_fn = CPOLoss(beta=0.1, cpo_alpha=1.0)

Parameters:

beta: Temperature for the preference loss
cpo_alpha: Weight of the behavior cloning (NLL) loss on chosen responses

ORPOLoss

Odds Ratio Preference Optimization unifies SFT and preference alignment in a single loss.

from twinkle.loss import ORPOLoss

loss_fn = ORPOLoss(beta=0.1)

The loss combines a standard NLL term on chosen responses with a log-odds-ratio penalty that pushes the model away from rejected responses.

All preference losses inherit shared utilities from PreferenceLossBase, including log-probability computation, chosen/rejected splitting, and sequence-level aggregation.

GKD Loss

Mon, 01 Jan 0001 00:00:00 +0000

Generalized Knowledge Distillation (GKD) loss uses Jensen-Shannon Divergence for distilling knowledge from a teacher model to a student model.

from twinkle.loss import GKDLoss

loss_fn = GKDLoss(
 teacher_mode='full', # 'full', 'topk_local', 'topk_remote'
 beta=0.5, # interpolation weight for JSD
 temperature=1.0,
)

model.set_loss(loss_fn)

Parameters:

teacher_mode: How teacher logits are obtained
- full: Full vocabulary logits from a local teacher model
- topk_local: Top-k logits from a local teacher with chunked computation for memory efficiency
- topk_remote: Top-k logits from a remote API teacher
beta: Interpolation weight between student and teacher distributions in JSD (0 = pure student, 1 = pure teacher)
temperature: Softmax temperature for both student and teacher distributions

The GKD loss implements chunked computation internally to reduce peak memory usage when working with large vocabularies.

GKD is useful for training smaller student models that mimic the behavior of larger teacher models, and supports both local and remote teacher setups.

GRPO Loss

Mon, 01 Jan 0001 00:00:00 +0000

Group Relative Policy Optimization (GRPO) and its variants implement policy gradient losses with PPO-style clipping and KL regularization.

GRPOLoss

The standard GRPO loss with importance sampling, PPO clipping, and optional KL penalty.

from twinkle.loss import GRPOLoss

loss_fn = GRPOLoss(
 clip_range=0.2,
 beta=0.01, # KL penalty coefficient
)

model.set_loss(loss_fn)

Parameters:

clip_range: PPO clipping range for importance weights (default: 0.2)
beta: KL divergence penalty coefficient. Set to 0 to disable KL regularization

The loss handles both standard batches and packed sequences (detected via position_ids). It computes per-token importance weights, applies PPO clipping, and optionally adds a KL penalty term against the reference policy.

Variants

Twinkle provides several GRPO variants:

GSPOLoss

Sequence-level importance sampling variant that computes importance weights at the sequence level rather than token level.

from twinkle.loss import GSPOLoss
loss_fn = GSPOLoss(clip_range=0.2, beta=0.01)

SAPOLoss

Soft-gated Advantage Policy Optimization applies a sigmoid gate on the advantage to control the optimization direction.

from twinkle.loss import SAPOLoss
loss_fn = SAPOLoss(clip_range=0.2, beta=0.01, tau=1.0)

CISPOLoss

Clipped Importance Sampling Policy Optimization applies explicit clipping to importance weights before multiplying with advantages.

from twinkle.loss import CISPOLoss
loss_fn = CISPOLoss(clip_range=0.2, beta=0.01)

BNPOLoss

Batch-Normalized Policy Optimization normalizes per-token loss across the batch before aggregation.

from twinkle.loss import BNPOLoss
loss_fn = BNPOLoss(clip_range=0.2, beta=0.01)

DRGRPOLoss

Dynamic Ratio GRPO with fixed normalization that uses a fixed denominator for importance weight computation.

from twinkle.loss import DRGRPOLoss
loss_fn = DRGRPOLoss(clip_range=0.2, beta=0.01)

All GRPO variants share the same base pipeline for packed-sequence handling, log-probability alignment, and KL penalty computation. They differ primarily in how importance weights and advantages are combined.

MSE Loss

Mon, 01 Jan 0001 00:00:00 +0000

Mean Squared Error loss for regression-style training tasks.

from twinkle.loss import MSELoss

loss_fn = MSELoss()
model.set_loss(loss_fn)

MSELoss computes the mean squared error between model output logits and the target labels. It is useful for tasks such as reward model training or value function estimation.

Building New Loss

Mon, 01 Jan 0001 00:00:00 +0000

The loss base class in Twinkle is defined as:

class Loss:

 def __call__(self, inputs: InputFeature, outputs: ModelOutput, **kwargs):
 ...

The loss input is the model’s InputFeature, the output is the model’s standard ModelOutput, and kwargs can be passed in the model’s calculate_loss. Since it is a class with a __call__ method, developers can also use Callable:

def my_loss(inputs: InputFeature, outputs: ModelOutput, extra_data1: int, extra_data2: dict):
 ...
 return loss

Use it in the model like this:

model.set_loss(my_loss)
model.calculate_loss(extra_data1=10, extra_data2={})

You can also upload the Loss to ModelScope/Hugging Face Hub and dynamically pull it when using:

model.set_loss('ms://my_group/my_loss')

Please refer to the plugin documentation for specific details.