trinity.common.rewards.dapo_reward module

Reward Function with Overlong Reward Shaping described in DAPO (https://arxiv.org/pdf/2503.14476)

class trinity.common.rewards.dapo_reward.MathDAPORewardFn(enable_overlong_penalty: bool | None = None, penalty_factor: float | None = None, max_response_length: int | None = None, cache_length: int | None = None)[source]

Bases: RewardFn

A reward function that follows the definition in DAPO for math task.

__init__(enable_overlong_penalty: bool | None = None, penalty_factor: float | None = None, max_response_length: int | None = None, cache_length: int | None = None) None[source]
compute_overlong_penalty(response_token)[source]