trinity.algorithm.add_strategy

Submodules

trinity.algorithm.add_strategy.add_strategy module

class trinity.algorithm.add_strategy.add_strategy.AddStrategy(writer: BufferWriter, **kwargs)[source]

Bases: ABC

__init__(writer: BufferWriter, **kwargs) → None[source]

abstract async add(experiences: List[Experience], step: int) → Tuple[int, Dict][source]

Add experiences to the buffer.

Parameters:

experiences (Experience) – The experiences to be added.
step (int) – The current step number.

Returns:

The number of experiences added to the buffer. Dict: Metrics for logging.

Return type:

int

abstract classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

class trinity.algorithm.add_strategy.add_strategy.GroupAdvantageStrategy(writer: BufferWriter, **kwargs)[source]

Bases: AddStrategy

An example AddStrategy that calculates group advantages.

abstract group_experiences(exps: List[Experience]) → Dict[str, List[Experience]][source]

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

abstract calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

async add(exps: List[Experience], step: int) → Tuple[int, Dict][source]

Add experiences to the buffer.

Parameters:

experiences (Experience) – The experiences to be added.
step (int) – The current step number.

Returns:

The number of experiences added to the buffer. Dict: Metrics for logging.

Return type:

int

class trinity.algorithm.add_strategy.add_strategy.GRPOAddStrategy(writer: BufferWriter, epsilon: float = 1e-06, **kwargs)[source]

Bases: GroupAdvantageStrategy

An example AddStrategy that calculates GRPO advantages.

__init__(writer: BufferWriter, epsilon: float = 1e-06, **kwargs) → None[source]

group_experiences(exps)[source]

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

class trinity.algorithm.add_strategy.add_strategy.OPMDAddStrategy(writer: BufferWriter, opmd_baseline: str = 'mean', tau: float = 1.0, **kwargs)[source]

Bases: GroupAdvantageStrategy

An example AddStrategy that calculates OPMD advantages.

__init__(writer: BufferWriter, opmd_baseline: str = 'mean', tau: float = 1.0, **kwargs) → None[source]

group_experiences(exps)[source]

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

class trinity.algorithm.add_strategy.add_strategy.RewardVarianceAddStrategy(writer: BufferWriter, variance_threshold: float = 0.0, **kwargs)[source]

Bases: AddStrategy

An example AddStrategy that filters experiences based on a reward variance threshold.

__init__(writer: BufferWriter, variance_threshold: float = 0.0, **kwargs) → None[source]

async add(experiences: List[Experience], step: int) → Tuple[int, Dict][source]

Add experiences to the buffer.

Parameters:

experiences (Experience) – The experiences to be added.
step (int) – The current step number.

Returns:

The number of experiences added to the buffer. Dict: Metrics for logging.

Return type:

int

classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

trinity.algorithm.add_strategy.add_strategy.group_by(experiences: List[Experience], id_type: Literal['task', 'run', 'step']) → Dict[str, List[Experience]][source]: Group experiences by ID.

trinity.algorithm.add_strategy.correct_bias_add_strategy module

class trinity.algorithm.add_strategy.correct_bias_add_strategy.CorrectBiasAddStrategy(writer: BufferWriter, epsilon: float = 1e-06, rank_penalty: float = 0.25, **kwargs)[source]

Bases: GRPOAddStrategy

An Addstrategy with GroupAdvantage that corrects for rank bias (https://arxiv.org/pdf/2506.02355)

__init__(writer: BufferWriter, epsilon: float = 1e-06, rank_penalty: float = 0.25, **kwargs) → None[source]

calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

trinity.algorithm.add_strategy.duplicate_add_strategy module

class trinity.algorithm.add_strategy.duplicate_add_strategy.DuplicateInformativeAddStrategy(writer: BufferWriter, variance_threshold: float = 0.0, **kwargs)[source]

Bases: AddStrategy

An AddStrategy that filters experiences based on reward variance and duplicates them to reach the target size. Ref: POLARIS (https://hkunlp.github.io/blog/2025/Polaris)

__init__(writer: BufferWriter, variance_threshold: float = 0.0, **kwargs) → None[source]

async add(experiences: List[Experience], step: int) → Tuple[int, Dict][source]

Add experiences to the buffer.

Parameters:

experiences (Experience) – The experiences to be added.
step (int) – The current step number.

Returns:

The number of experiences added to the buffer. Dict: Metrics for logging.

Return type:

int

classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

trinity.algorithm.add_strategy.step_wise_add_strategy module

class trinity.algorithm.add_strategy.step_wise_add_strategy.StepWiseGRPOStrategy(writer: BufferWriter, epsilon: float = 1e-06, enable_step_norm: bool = False, **kwargs)[source]

Bases: AddStrategy

An example AddStrategy that broadcasts advantages from the last step to previous steps. Inspired by rLLM (https://github.com/rllm-org/rllm).

__init__(writer: BufferWriter, epsilon: float = 1e-06, enable_step_norm: bool = False, **kwargs) → None[source]

calculate_group_advantage(exps: Dict[str, Experience]) → Tuple[Dict[str, float], Dict[str, float]][source]

Calculate group advantage for a given group of experiences.

Parameters:: exps (Dict[str, Experience]) – One experience per run, keyed by run ID.
Returns:: A tuple containing the scores for each run. Dict[str, float]: Metrics for logging.
Return type:: Dict[str, float]

broadcast_advantages(run_exps: Dict[str, List[Experience]], scores: Dict[str, float]) → Dict[str, List[Experience]][source]

Broadcast the calculated advantages to all previous steps in each run.

Parameters:

run_exps (Dict[str, List[Experience]]) – Experiences grouped by run ID.
scores (Dict[str, float]) – Calculated scores for each run.

Returns:

Updated experiences with advantages broadcasted.

Return type:

Dict[str, List[Experience]]

async add(exps: List[Experience], step: int) → Tuple[int, Dict][source]

Add experiences to the buffer.

Parameters:

experiences (Experience) – The experiences to be added.
step (int) – The current step number.

Returns:

The number of experiences added to the buffer. Dict: Metrics for logging.

Return type:

int

classmethod default_args() → Dict[source]: Return the default configuration for this strategy.

Module contents

class trinity.algorithm.add_strategy.AddStrategy(writer: BufferWriter, **kwargs)[source]

Bases: ABC

__init__(writer: BufferWriter, **kwargs) → None[source]

abstract async add(experiences: List[Experience], step: int) → Tuple[int, Dict][source]

Add experiences to the buffer.

Parameters:

experiences (Experience) – The experiences to be added.
step (int) – The current step number.

Returns:

The number of experiences added to the buffer. Dict: Metrics for logging.

Return type:

int

abstract classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

class trinity.algorithm.add_strategy.GRPOAddStrategy(writer: BufferWriter, epsilon: float = 1e-06, **kwargs)[source]

Bases: GroupAdvantageStrategy

An example AddStrategy that calculates GRPO advantages.

__init__(writer: BufferWriter, epsilon: float = 1e-06, **kwargs) → None[source]

group_experiences(exps)[source]

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

class trinity.algorithm.add_strategy.OPMDAddStrategy(writer: BufferWriter, opmd_baseline: str = 'mean', tau: float = 1.0, **kwargs)[source]

Bases: GroupAdvantageStrategy

An example AddStrategy that calculates OPMD advantages.

__init__(writer: BufferWriter, opmd_baseline: str = 'mean', tau: float = 1.0, **kwargs) → None[source]

group_experiences(exps)[source]

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

class trinity.algorithm.add_strategy.StepWiseGRPOStrategy(writer: BufferWriter, epsilon: float = 1e-06, enable_step_norm: bool = False, **kwargs)[source]

Bases: AddStrategy

An example AddStrategy that broadcasts advantages from the last step to previous steps. Inspired by rLLM (https://github.com/rllm-org/rllm).

__init__(writer: BufferWriter, epsilon: float = 1e-06, enable_step_norm: bool = False, **kwargs) → None[source]

calculate_group_advantage(exps: Dict[str, Experience]) → Tuple[Dict[str, float], Dict[str, float]][source]

Calculate group advantage for a given group of experiences.

Parameters:: exps (Dict[str, Experience]) – One experience per run, keyed by run ID.
Returns:: A tuple containing the scores for each run. Dict[str, float]: Metrics for logging.
Return type:: Dict[str, float]

broadcast_advantages(run_exps: Dict[str, List[Experience]], scores: Dict[str, float]) → Dict[str, List[Experience]][source]

Broadcast the calculated advantages to all previous steps in each run.

Parameters:

run_exps (Dict[str, List[Experience]]) – Experiences grouped by run ID.
scores (Dict[str, float]) – Calculated scores for each run.

Returns:

Updated experiences with advantages broadcasted.

Return type:

Dict[str, List[Experience]]

async add(exps: List[Experience], step: int) → Tuple[int, Dict][source]

Add experiences to the buffer.

Parameters:

experiences (Experience) – The experiences to be added.
step (int) – The current step number.

Returns:

The number of experiences added to the buffer. Dict: Metrics for logging.

Return type:

int

classmethod default_args() → Dict[source]: Return the default configuration for this strategy.

class trinity.algorithm.add_strategy.RewardVarianceAddStrategy(writer: BufferWriter, variance_threshold: float = 0.0, **kwargs)[source]

Bases: AddStrategy

An example AddStrategy that filters experiences based on a reward variance threshold.

__init__(writer: BufferWriter, variance_threshold: float = 0.0, **kwargs) → None[source]

async add(experiences: List[Experience], step: int) → Tuple[int, Dict][source]

Add experiences to the buffer.

Parameters:

experiences (Experience) – The experiences to be added.
step (int) – The current step number.

Returns:

The number of experiences added to the buffer. Dict: Metrics for logging.

Return type:

int

classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

class trinity.algorithm.add_strategy.CorrectBiasAddStrategy(writer: BufferWriter, epsilon: float = 1e-06, rank_penalty: float = 0.25, **kwargs)[source]

Bases: GRPOAddStrategy

An Addstrategy with GroupAdvantage that corrects for rank bias (https://arxiv.org/pdf/2506.02355)

__init__(writer: BufferWriter, epsilon: float = 1e-06, rank_penalty: float = 0.25, **kwargs) → None[source]

calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict

class trinity.algorithm.add_strategy.DuplicateInformativeAddStrategy(writer: BufferWriter, variance_threshold: float = 0.0, **kwargs)[source]

Bases: AddStrategy

An AddStrategy that filters experiences based on reward variance and duplicates them to reach the target size. Ref: POLARIS (https://hkunlp.github.io/blog/2025/Polaris)

__init__(writer: BufferWriter, variance_threshold: float = 0.0, **kwargs) → None[source]

async add(experiences: List[Experience], step: int) → Tuple[int, Dict][source]

Add experiences to the buffer.

Parameters:

experiences (Experience) – The experiences to be added.
step (int) – The current step number.

Returns:

The number of experiences added to the buffer. Dict: Metrics for logging.

Return type:

int

classmethod default_args() → dict[source]

Get the default arguments of the add strategy.

Returns:: The default arguments.
Return type:: dict