maxent_grpo.rewards¶
Copyright 2025 Liv d’Aliberti
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Reward utilities shared across MaxEnt-GRPO components.
Two layers of rewards are exposed:
maxent_grpo.rewards.basicLightweight registry used by the baseline GRPO trainer.
maxent_grpo.rewards.maxentRe-exports the richer reward/statistics helpers used inside the MaxEnt runner.
- class maxent_grpo.rewards.RewardFunction(*args, **kwargs)[source]¶
Bases:
ProtocolProtocol describing batch reward functions.
- maxent_grpo.rewards.accuracy_reward(completions, answer, **_kwargs)[source]¶
Open-R1-style accuracy reward (1.0 exact math match else 0.0).
This keeps compatibility with Open-R1 reward names while using the same canonicalization/extraction logic as
pure_accuracy_reward_math, including boxed answers and list-valued gold labels.
- maxent_grpo.rewards.binary_code_reward(completions, answer, **kwargs)[source]¶
Binary wrapper around
python_unit_test_rewardfor compatibility.
- maxent_grpo.rewards.boxed_accuracy_reward_math(completions, answer, **_kwargs)[source]¶
Dr.GRPO-style binary reward based on boxed final answers.
A completion is rewarded when its final
\boxed{...}(or<answer>for compatibility) matches one of the canonical gold answers. Plain unboxed answers do not receive credit.
- maxent_grpo.rewards.get_missing_boxed_answer_penalty_reward(penalty=-0.05)[source]¶
Return a fixed penalty when no boxed answer is present in the completion.
- Parameters:
penalty (float)
- Return type:
- maxent_grpo.rewards.seed_paper_answer_tag_accuracy_reward_math(completions, answer, fast=False, **_kwargs)[source]¶
Official OAT/SEED answer-tag reward with the same timeout wrapper as training.
- maxent_grpo.rewards.seed_paper_boxed_accuracy_reward_math(completions, answer, fast=False, **_kwargs)[source]¶
Official OAT/SEED boxed reward with the same timeout wrapper as training.
- maxent_grpo.rewards.format_reward(completions, **_kwargs)[source]¶
Open-R1-compatible strict think/answer formatting reward.
- maxent_grpo.rewards.get_reward_funcs(script_args, _ref_model=None, _tokenizer=None)[source]¶
Resolve reward function callables from names.
- Parameters:
script_args (RewardConfig)
_ref_model (Optional['PreTrainedModel'])
_tokenizer (Optional['PreTrainedTokenizerBase'])
- Return type:
List[’RewardFunction’]
- maxent_grpo.rewards.get_code_format_reward(language='python')[source]¶
Return Open-R1-compatible code-format reward closure.
- Parameters:
language (str)
- Return type:
- maxent_grpo.rewards.get_cosine_scaled_reward(min_value_wrong=-1.0, max_value_wrong=-0.5, min_value_correct=0.5, max_value_correct=1.0, max_len=1000)[source]¶
Return a length-scaled reward closure compatible with Open-R1 configs.
- maxent_grpo.rewards.get_repetition_penalty_reward(ngram_size, max_penalty)[source]¶
Return an Open-R1-style repetition penalty reward closure.
- Parameters:
- Return type:
- maxent_grpo.rewards.len_reward(completions, answer, **_kwargs)[source]¶
Length-based reward that discourages verbose incorrect outputs.
- maxent_grpo.rewards.pure_accuracy_math_correctness(completions, answer, *, allow_last_line_fallback=False)[source]¶
Return binary correctness aligned with
pure_accuracy_reward_math.A completion is considered correct when either:
<answer>...</answer>canonicalizes to the gold answer, or(optional) no extracted answer matched but the final non-empty line matches.
- maxent_grpo.rewards.pure_accuracy_reward_math(completions, answer, **_kwargs)[source]¶
Reward exact math matches with a small formatting bonus when wrong.
Correctness is detected from
<answer>...</answer>and falls back to the last non-empty line (with known format tags stripped) when needed.Reward scale for correct outputs: - full
<think>...</think><answer>...</answer>(4 distinct tags):1.0- otherwise:0.5 * _tag_multiplier(tag_total, tag_unique)Reward scale for incorrect outputs: - full
<think>...</think><answer>...</answer>(4 distinct tags):0.05- otherwise:0.0
- maxent_grpo.rewards.python_unit_test_reward(completions, answer, *, prompts=None, entry_point=None, **_kwargs)[source]¶
Run local Python-unit-test rewards for MBPP/HumanEval/APPS payloads.
Supports: - MBPP:
answeristest_list(list ofassertstrings) - HumanEval:answeristestcode containingdef check(...)- APPS:answerisinput_outputpayload withinputs/outputs
- maxent_grpo.rewards.reasoning_steps_reward(completions, **_kwargs)[source]¶
Reward explicit step-by-step structure in natural language.
- maxent_grpo.rewards.tag_count_reward(completions, **_kwargs)[source]¶
Open-R1-compatible partial credit based on expected tag counts.
- maxent_grpo.rewards.uses_pure_accuracy_math_reward(reward_funcs)[source]¶
Return
Truewhen any configured reward resolves to pure math reward.
Modules