maxent_grpo.rewards.basic

Copyright 2025 Liv d’Aliberti

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Baseline-friendly reward registry used by GRPO training.

Functions

_bind_reward_kwargs(reward_fn, /, **bound_kwargs)

Return a named wrapper that pre-binds keyword args for a reward fn.

_call_seed_paper_reward_oat_parity(...)

Mirror OAT's MATHOracle.get_reward timeout behavior for a single sample.

_canon_math(s)

Canonicalize simple math answers for exact-match comparison.

_close_seed_paper_reward_pool()

Close the lazy OAT-parity reward pool for the current process.

_count_format_tags(text)

Return (total tag count, unique tag count) for think/answer tags.

_extract_apps_cases(payload)

Extract APPS-style IO pairs from an input_output payload.

_extract_boxed_answer(text)

Return the last \boxed{...}/\fbox{...} payload when present.

_extract_content(comp)

Extract assistant text from common completion shapes.

_extract_humaneval_test(payload)

Extract HumanEval-style test program body containing check.

_extract_mbpp_tests(payload)

Extract MBPP-style assert tests from payload when present.

_extract_prompt_text(prompt)

Return user-facing prompt text from string/chat prompt shapes.

_extract_python_code(text)

Extract a Python snippet from answer/code fences or return raw text.

_gold_math_candidates(gold)

Return canonical gold-answer candidates from scalar/list payloads.

_iter_boxed_answers(text)

Return parsed boxed/fbox payloads and their exclusive end offsets.

_load_seed_paper_reward_fn(reward_name)

Load an official SEED/OAT reward function from the repo-local checkout.

_normalize_text_lines(value)

Normalize strings/lists into comparable stripped lines.

_outputs_match(predicted, expected)

Return whether predicted and expected outputs match after normalization.

_parse_answer_payload(raw)

Best-effort parse for list/dict payloads serialized as strings.

_parse_entry_point(prompt_text, explicit)

Return function name for HumanEval checks from explicit value or prompt.

_prepare_seed_paper_import_paths(repo_dir)

Match the official symbolic stack before importing the grader module.

_pure_accuracy_math_match_flags(text, gold)

Return (answer_tag_match, fallback_last_line_match).

_run_script(script, timeout_s)

Execute a Python snippet in isolated mode and return stdout.

_score_apps_code(code, cases, timeout_s)

Execute APPS stdin/stdout tests and return pass fraction.

_score_humaneval_code(code, test_program, ...)

Execute HumanEval checks and return binary pass/fail score.

_score_mbpp_code(code, tests, timeout_s)

Execute MBPP assertions and return pass fraction in [0, 1].

_score_python_unit_tests_sample(completion, ...)

Score one completion against MBPP/HumanEval/APPS payloads.

_seed_paper_answer_tag_reward_fn()

Load the official answer-tag reward function from the repo-local checkout.

_seed_paper_boxed_reward_fn()

Load the official boxed reward function from the repo-local checkout.

_seed_paper_repo_dir()

Resolve the repo-local official grader checkout used for OAT parity.

_seed_paper_reward_pool()

Return a process-local worker pool mirroring OAT's Pool(2) reward path.

_seed_paper_reward_worker(reward_name, text, ...)

Worker entrypoint for OAT-style reward calls in a separate process.

_seed_paper_site_packages_dir()

Return the paper eval site-packages dir when a repo-local env is available.

_tag_multiplier(tag_total, tag_unique)

Return the reward multiplier for the observed tag counts.

accuracy_reward(completions, answer, **_kwargs)

Open-R1-style accuracy reward (1.0 exact math match else 0.0).

binary_code_reward(completions, answer, **kwargs)

Binary wrapper around python_unit_test_reward for compatibility.

boxed_accuracy_reward_math(completions, ...)

Dr.GRPO-style binary reward based on boxed final answers.

format_reward(completions, **_kwargs)

Open-R1-compatible strict think/answer formatting reward.

get_code_format_reward([language])

Return Open-R1-compatible code-format reward closure.

get_cosine_scaled_reward([min_value_wrong, ...])

Return a length-scaled reward closure compatible with Open-R1 configs.

get_missing_boxed_answer_penalty_reward([...])

Return a fixed penalty when no boxed answer is present in the completion.

get_repetition_penalty_reward(ngram_size, ...)

Return an Open-R1-style repetition penalty reward closure.

get_reward_funcs(script_args[, _ref_model, ...])

Resolve reward function callables from names.

len_reward(completions, answer, **_kwargs)

Length-based reward that discourages verbose incorrect outputs.

pure_accuracy_math_correctness(completions, ...)

Return binary correctness aligned with pure_accuracy_reward_math.

pure_accuracy_reward_math(completions, ...)

Reward exact math matches with a small formatting bonus when wrong.

python_unit_test_reward(completions, answer, *)

Run local Python-unit-test rewards for MBPP/HumanEval/APPS payloads.

reasoning_steps_reward(completions, **_kwargs)

Reward explicit step-by-step structure in natural language.

seed_paper_answer_tag_accuracy_reward_math(...)

Official OAT/SEED answer-tag reward with the same timeout wrapper as training.

seed_paper_boxed_accuracy_reward_math(...[, ...])

Official OAT/SEED boxed reward with the same timeout wrapper as training.

tag_count_reward(completions, **_kwargs)

Open-R1-compatible partial credit based on expected tag counts.

truncate_after_first_boxed_answer(text)

Trim a completion immediately after the first valid boxed answer.

uses_pure_accuracy_math_reward(reward_funcs)

Return True when any configured reward resolves to pure math reward.

Classes

RewardConfig(*args, **kwargs)

Minimal protocol describing the reward configuration interface.

RewardFunction(*args, **kwargs)

Protocol describing batch reward functions.

class maxent_grpo.rewards.basic.RewardFunction(*args, **kwargs)[source]

Bases: Protocol

Protocol describing batch reward functions.

class maxent_grpo.rewards.basic.RewardConfig(*args, **kwargs)[source]

Bases: Protocol

Minimal protocol describing the reward configuration interface.

reward_funcs: List[str]
maxent_grpo.rewards.basic.accuracy_reward(completions, answer, **_kwargs)[source]

Open-R1-style accuracy reward (1.0 exact math match else 0.0).

This keeps compatibility with Open-R1 reward names while using the same canonicalization/extraction logic as pure_accuracy_reward_math, including boxed answers and list-valued gold labels.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.format_reward(completions, **_kwargs)[source]

Open-R1-compatible strict think/answer formatting reward.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.get_reward_funcs(script_args, _ref_model=None, _tokenizer=None)[source]

Resolve reward function callables from names.

Parameters:
  • script_args (RewardConfig)

  • _ref_model (Optional['PreTrainedModel'])

  • _tokenizer (Optional['PreTrainedTokenizerBase'])

Return type:

List[’RewardFunction’]

maxent_grpo.rewards.basic.get_missing_boxed_answer_penalty_reward(penalty=-0.05)[source]

Return a fixed penalty when no boxed answer is present in the completion.

Parameters:

penalty (float)

Return type:

RewardFunction

maxent_grpo.rewards.basic.boxed_accuracy_reward_math(completions, answer, **_kwargs)[source]

Dr.GRPO-style binary reward based on boxed final answers.

A completion is rewarded when its final \boxed{...} (or <answer> for compatibility) matches one of the canonical gold answers. Plain unboxed answers do not receive credit.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.seed_paper_boxed_accuracy_reward_math(completions, answer, fast=False, **_kwargs)[source]

Official OAT/SEED boxed reward with the same timeout wrapper as training.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • fast (bool)

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.seed_paper_answer_tag_accuracy_reward_math(completions, answer, fast=False, **_kwargs)[source]

Official OAT/SEED answer-tag reward with the same timeout wrapper as training.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • fast (bool)

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.tag_count_reward(completions, **_kwargs)[source]

Open-R1-compatible partial credit based on expected tag counts.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.reasoning_steps_reward(completions, **_kwargs)[source]

Reward explicit step-by-step structure in natural language.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.get_cosine_scaled_reward(min_value_wrong=-1.0, max_value_wrong=-0.5, min_value_correct=0.5, max_value_correct=1.0, max_len=1000)[source]

Return a length-scaled reward closure compatible with Open-R1 configs.

Parameters:
  • min_value_wrong (float)

  • max_value_wrong (float)

  • min_value_correct (float)

  • max_value_correct (float)

  • max_len (int)

Return type:

RewardFunction

maxent_grpo.rewards.basic.get_repetition_penalty_reward(ngram_size, max_penalty)[source]

Return an Open-R1-style repetition penalty reward closure.

Parameters:
Return type:

RewardFunction

maxent_grpo.rewards.basic.len_reward(completions, answer, **_kwargs)[source]

Length-based reward that discourages verbose incorrect outputs.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[str])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.get_code_format_reward(language='python')[source]

Return Open-R1-compatible code-format reward closure.

Parameters:

language (str)

Return type:

RewardFunction

maxent_grpo.rewards.basic.pure_accuracy_math_correctness(completions, answer, *, allow_last_line_fallback=False)[source]

Return binary correctness aligned with pure_accuracy_reward_math.

A completion is considered correct when either:

  1. <answer>...</answer> canonicalizes to the gold answer, or

  2. (optional) no extracted answer matched but the final non-empty line matches.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • allow_last_line_fallback (bool)

Return type:

List[bool]

maxent_grpo.rewards.basic.pure_accuracy_reward_math(completions, answer, **_kwargs)[source]

Reward exact math matches with a small formatting bonus when wrong.

Correctness is detected from <answer>...</answer> and falls back to the last non-empty line (with known format tags stripped) when needed.

Reward scale for correct outputs: - full <think>...</think><answer>...</answer> (4 distinct tags): 1.0 - otherwise: 0.5 * _tag_multiplier(tag_total, tag_unique)

Reward scale for incorrect outputs: - full <think>...</think><answer>...</answer> (4 distinct tags): 0.05 - otherwise: 0.0

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.python_unit_test_reward(completions, answer, *, prompts=None, entry_point=None, **_kwargs)[source]

Run local Python-unit-test rewards for MBPP/HumanEval/APPS payloads.

Supports: - MBPP: answer is test_list (list of assert strings) - HumanEval: answer is test code containing def check(...) - APPS: answer is input_output payload with inputs/outputs

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • prompts (List[Any] | None)

  • entry_point (List[Any] | None)

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.binary_code_reward(completions, answer, **kwargs)[source]

Binary wrapper around python_unit_test_reward for compatibility.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.basic.truncate_after_first_boxed_answer(text)[source]

Trim a completion immediately after the first valid boxed answer.

Parameters:

text (str)

Return type:

str

maxent_grpo.rewards.basic.uses_pure_accuracy_math_reward(reward_funcs)[source]

Return True when any configured reward resolves to pure math reward.

Parameters:

reward_funcs (Sequence[Any])

Return type:

bool