maxent_grpo.rewards.basic¶
Copyright 2025 Liv d’Aliberti
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Baseline-friendly reward registry used by GRPO training.
Functions
|
Return a named wrapper that pre-binds keyword args for a reward fn. |
|
Mirror OAT's MATHOracle.get_reward timeout behavior for a single sample. |
|
Canonicalize simple math answers for exact-match comparison. |
|
Close the lazy OAT-parity reward pool for the current process. |
|
Return (total tag count, unique tag count) for think/answer tags. |
|
Extract APPS-style IO pairs from an |
|
Return the last |
|
Extract assistant text from common completion shapes. |
|
Extract HumanEval-style test program body containing |
|
Extract MBPP-style |
|
Return user-facing prompt text from string/chat prompt shapes. |
|
Extract a Python snippet from answer/code fences or return raw text. |
|
Return canonical gold-answer candidates from scalar/list payloads. |
|
Return parsed boxed/fbox payloads and their exclusive end offsets. |
|
Load an official SEED/OAT reward function from the repo-local checkout. |
|
Normalize strings/lists into comparable stripped lines. |
|
Return whether predicted and expected outputs match after normalization. |
|
Best-effort parse for list/dict payloads serialized as strings. |
|
Return function name for HumanEval checks from explicit value or prompt. |
|
Match the official symbolic stack before importing the grader module. |
|
Return (answer_tag_match, fallback_last_line_match). |
|
Execute a Python snippet in isolated mode and return stdout. |
|
Execute APPS stdin/stdout tests and return pass fraction. |
|
Execute HumanEval checks and return binary pass/fail score. |
|
Execute MBPP assertions and return pass fraction in |
|
Score one completion against MBPP/HumanEval/APPS payloads. |
|
Load the official answer-tag reward function from the repo-local checkout. |
|
Load the official boxed reward function from the repo-local checkout. |
|
Resolve the repo-local official grader checkout used for OAT parity. |
|
Return a process-local worker pool mirroring OAT's Pool(2) reward path. |
|
Worker entrypoint for OAT-style reward calls in a separate process. |
|
Return the paper eval site-packages dir when a repo-local env is available. |
|
Return the reward multiplier for the observed tag counts. |
|
Open-R1-style accuracy reward (1.0 exact math match else 0.0). |
|
Binary wrapper around |
|
Dr.GRPO-style binary reward based on boxed final answers. |
|
Open-R1-compatible strict think/answer formatting reward. |
|
Return Open-R1-compatible code-format reward closure. |
|
Return a length-scaled reward closure compatible with Open-R1 configs. |
Return a fixed penalty when no boxed answer is present in the completion. |
|
|
Return an Open-R1-style repetition penalty reward closure. |
|
Resolve reward function callables from names. |
|
Length-based reward that discourages verbose incorrect outputs. |
|
Return binary correctness aligned with |
|
Reward exact math matches with a small formatting bonus when wrong. |
|
Run local Python-unit-test rewards for MBPP/HumanEval/APPS payloads. |
|
Reward explicit step-by-step structure in natural language. |
Official OAT/SEED answer-tag reward with the same timeout wrapper as training. |
|
|
Official OAT/SEED boxed reward with the same timeout wrapper as training. |
|
Open-R1-compatible partial credit based on expected tag counts. |
Trim a completion immediately after the first valid boxed answer. |
|
|
Return |
Classes
|
Minimal protocol describing the reward configuration interface. |
|
Protocol describing batch reward functions. |
- class maxent_grpo.rewards.basic.RewardFunction(*args, **kwargs)[source]¶
Bases:
ProtocolProtocol describing batch reward functions.
- class maxent_grpo.rewards.basic.RewardConfig(*args, **kwargs)[source]¶
Bases:
ProtocolMinimal protocol describing the reward configuration interface.
- maxent_grpo.rewards.basic.accuracy_reward(completions, answer, **_kwargs)[source]¶
Open-R1-style accuracy reward (1.0 exact math match else 0.0).
This keeps compatibility with Open-R1 reward names while using the same canonicalization/extraction logic as
pure_accuracy_reward_math, including boxed answers and list-valued gold labels.
- maxent_grpo.rewards.basic.format_reward(completions, **_kwargs)[source]¶
Open-R1-compatible strict think/answer formatting reward.
- maxent_grpo.rewards.basic.get_reward_funcs(script_args, _ref_model=None, _tokenizer=None)[source]¶
Resolve reward function callables from names.
- Parameters:
script_args (RewardConfig)
_ref_model (Optional['PreTrainedModel'])
_tokenizer (Optional['PreTrainedTokenizerBase'])
- Return type:
List[’RewardFunction’]
- maxent_grpo.rewards.basic.get_missing_boxed_answer_penalty_reward(penalty=-0.05)[source]¶
Return a fixed penalty when no boxed answer is present in the completion.
- Parameters:
penalty (float)
- Return type:
- maxent_grpo.rewards.basic.boxed_accuracy_reward_math(completions, answer, **_kwargs)[source]¶
Dr.GRPO-style binary reward based on boxed final answers.
A completion is rewarded when its final
\boxed{...}(or<answer>for compatibility) matches one of the canonical gold answers. Plain unboxed answers do not receive credit.
- maxent_grpo.rewards.basic.seed_paper_boxed_accuracy_reward_math(completions, answer, fast=False, **_kwargs)[source]¶
Official OAT/SEED boxed reward with the same timeout wrapper as training.
- maxent_grpo.rewards.basic.seed_paper_answer_tag_accuracy_reward_math(completions, answer, fast=False, **_kwargs)[source]¶
Official OAT/SEED answer-tag reward with the same timeout wrapper as training.
- maxent_grpo.rewards.basic.tag_count_reward(completions, **_kwargs)[source]¶
Open-R1-compatible partial credit based on expected tag counts.
- maxent_grpo.rewards.basic.reasoning_steps_reward(completions, **_kwargs)[source]¶
Reward explicit step-by-step structure in natural language.
- maxent_grpo.rewards.basic.get_cosine_scaled_reward(min_value_wrong=-1.0, max_value_wrong=-0.5, min_value_correct=0.5, max_value_correct=1.0, max_len=1000)[source]¶
Return a length-scaled reward closure compatible with Open-R1 configs.
- maxent_grpo.rewards.basic.get_repetition_penalty_reward(ngram_size, max_penalty)[source]¶
Return an Open-R1-style repetition penalty reward closure.
- Parameters:
- Return type:
- maxent_grpo.rewards.basic.len_reward(completions, answer, **_kwargs)[source]¶
Length-based reward that discourages verbose incorrect outputs.
- maxent_grpo.rewards.basic.get_code_format_reward(language='python')[source]¶
Return Open-R1-compatible code-format reward closure.
- Parameters:
language (str)
- Return type:
- maxent_grpo.rewards.basic.pure_accuracy_math_correctness(completions, answer, *, allow_last_line_fallback=False)[source]¶
Return binary correctness aligned with
pure_accuracy_reward_math.A completion is considered correct when either:
<answer>...</answer>canonicalizes to the gold answer, or(optional) no extracted answer matched but the final non-empty line matches.
- maxent_grpo.rewards.basic.pure_accuracy_reward_math(completions, answer, **_kwargs)[source]¶
Reward exact math matches with a small formatting bonus when wrong.
Correctness is detected from
<answer>...</answer>and falls back to the last non-empty line (with known format tags stripped) when needed.Reward scale for correct outputs: - full
<think>...</think><answer>...</answer>(4 distinct tags):1.0- otherwise:0.5 * _tag_multiplier(tag_total, tag_unique)Reward scale for incorrect outputs: - full
<think>...</think><answer>...</answer>(4 distinct tags):0.05- otherwise:0.0
- maxent_grpo.rewards.basic.python_unit_test_reward(completions, answer, *, prompts=None, entry_point=None, **_kwargs)[source]¶
Run local Python-unit-test rewards for MBPP/HumanEval/APPS payloads.
Supports: - MBPP:
answeristest_list(list ofassertstrings) - HumanEval:answeristestcode containingdef check(...)- APPS:answerisinput_outputpayload withinputs/outputs
- maxent_grpo.rewards.basic.binary_code_reward(completions, answer, **kwargs)[source]¶
Binary wrapper around
python_unit_test_rewardfor compatibility.