trinity.common.rewards.naive_dapo_score module

目录

trinity.common.rewards.naive_dapo_score module#

This file contains the naive dapo reward function for math tasks. Adapted from LLM360/Reasoning360

trinity.common.rewards.naive_dapo_score.normalize_final_answer(final_answer: str) → str[源代码]#

Normalize a final answer to a quantitative reasoning question.

参数:: final_answer -- The answer string to normalize
返回:: Normalized answer string

trinity.common.rewards.naive_dapo_score.timeout(timeout_seconds: int = 8)[源代码]#

trinity.common.rewards.naive_dapo_score.count_unknown_letters_in_expr(expr: str)[源代码]#

trinity.common.rewards.naive_dapo_score.should_allow_eval(expr: str)[源代码]#

trinity.common.rewards.naive_dapo_score.are_equal_under_sympy(ground_truth_normalized: str, given_normalized: str)[源代码]#

trinity.common.rewards.naive_dapo_score.split_tuple(expr: str)[源代码]#: Split the elements in a tuple/interval, while handling well-formatted commas in large numbers

trinity.common.rewards.naive_dapo_score.grade_answer(given_answer: str, ground_truth: str) → tuple[bool, str][源代码]#: The answer will be considered correct if: (a) it normalizes to the same string as the ground truth answer OR (b) sympy can simplify the difference between the expressions to 0

trinity.common.rewards.naive_dapo_score.match_answer(response)[源代码]#

trinity.common.rewards.naive_dapo_score.compute_score(solution_str: str, ground_truth: str) → float[源代码]#

Compute the reward score for a solution. This draws heavily from the LLM-as-judge and PRIME reward functions

参数:

solution_str -- The solution string
ground_truth -- The ground truth answer
extra_info -- dict with additional info for the score computation

返回:

Reward score (1.0 for correct, 0.0 for incorrect)