maxent_grpo.rewards

Copyright 2025 Liv d’Aliberti

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Reward utilities shared across MaxEnt-GRPO components.

Two layers of rewards are exposed:

maxent_grpo.rewards.basic

Lightweight registry used by the baseline GRPO trainer.

maxent_grpo.rewards.maxent

Re-exports the richer reward/statistics helpers used inside the MaxEnt runner.

class maxent_grpo.rewards.RewardFunction(*args, **kwargs)[source]

Bases: Protocol

Protocol describing batch reward functions.

maxent_grpo.rewards.accuracy_reward(completions, answer, **_kwargs)[source]

Open-R1-style accuracy reward (1.0 exact math match else 0.0).

This keeps compatibility with Open-R1 reward names while using the same canonicalization/extraction logic as pure_accuracy_reward_math, including boxed answers and list-valued gold labels.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.binary_code_reward(completions, answer, **kwargs)[source]

Binary wrapper around python_unit_test_reward for compatibility.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.boxed_accuracy_reward_math(completions, answer, **_kwargs)[source]

Dr.GRPO-style binary reward based on boxed final answers.

A completion is rewarded when its final \boxed{...} (or <answer> for compatibility) matches one of the canonical gold answers. Plain unboxed answers do not receive credit.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.get_missing_boxed_answer_penalty_reward(penalty=-0.05)[source]

Return a fixed penalty when no boxed answer is present in the completion.

Parameters:

penalty (float)

Return type:

RewardFunction

maxent_grpo.rewards.seed_paper_answer_tag_accuracy_reward_math(completions, answer, fast=False, **_kwargs)[source]

Official OAT/SEED answer-tag reward with the same timeout wrapper as training.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • fast (bool)

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.seed_paper_boxed_accuracy_reward_math(completions, answer, fast=False, **_kwargs)[source]

Official OAT/SEED boxed reward with the same timeout wrapper as training.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • fast (bool)

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.format_reward(completions, **_kwargs)[source]

Open-R1-compatible strict think/answer formatting reward.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.get_reward_funcs(script_args, _ref_model=None, _tokenizer=None)[source]

Resolve reward function callables from names.

Parameters:
  • script_args (RewardConfig)

  • _ref_model (Optional['PreTrainedModel'])

  • _tokenizer (Optional['PreTrainedTokenizerBase'])

Return type:

List[’RewardFunction’]

maxent_grpo.rewards.get_code_format_reward(language='python')[source]

Return Open-R1-compatible code-format reward closure.

Parameters:

language (str)

Return type:

RewardFunction

maxent_grpo.rewards.get_cosine_scaled_reward(min_value_wrong=-1.0, max_value_wrong=-0.5, min_value_correct=0.5, max_value_correct=1.0, max_len=1000)[source]

Return a length-scaled reward closure compatible with Open-R1 configs.

Parameters:
  • min_value_wrong (float)

  • max_value_wrong (float)

  • min_value_correct (float)

  • max_value_correct (float)

  • max_len (int)

Return type:

RewardFunction

maxent_grpo.rewards.get_repetition_penalty_reward(ngram_size, max_penalty)[source]

Return an Open-R1-style repetition penalty reward closure.

Parameters:
Return type:

RewardFunction

maxent_grpo.rewards.len_reward(completions, answer, **_kwargs)[source]

Length-based reward that discourages verbose incorrect outputs.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[str])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.pure_accuracy_math_correctness(completions, answer, *, allow_last_line_fallback=False)[source]

Return binary correctness aligned with pure_accuracy_reward_math.

A completion is considered correct when either:

  1. <answer>...</answer> canonicalizes to the gold answer, or

  2. (optional) no extracted answer matched but the final non-empty line matches.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • allow_last_line_fallback (bool)

Return type:

List[bool]

maxent_grpo.rewards.pure_accuracy_reward_math(completions, answer, **_kwargs)[source]

Reward exact math matches with a small formatting bonus when wrong.

Correctness is detected from <answer>...</answer> and falls back to the last non-empty line (with known format tags stripped) when needed.

Reward scale for correct outputs: - full <think>...</think><answer>...</answer> (4 distinct tags): 1.0 - otherwise: 0.5 * _tag_multiplier(tag_total, tag_unique)

Reward scale for incorrect outputs: - full <think>...</think><answer>...</answer> (4 distinct tags): 0.05 - otherwise: 0.0

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.python_unit_test_reward(completions, answer, *, prompts=None, entry_point=None, **_kwargs)[source]

Run local Python-unit-test rewards for MBPP/HumanEval/APPS payloads.

Supports: - MBPP: answer is test_list (list of assert strings) - HumanEval: answer is test code containing def check(...) - APPS: answer is input_output payload with inputs/outputs

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • answer (List[Any])

  • prompts (List[Any] | None)

  • entry_point (List[Any] | None)

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.reasoning_steps_reward(completions, **_kwargs)[source]

Reward explicit step-by-step structure in natural language.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.tag_count_reward(completions, **_kwargs)[source]

Open-R1-compatible partial credit based on expected tag counts.

Parameters:
  • completions (List[maxent_grpo.rewards.basic.CompletionType])

  • _kwargs (Any)

Return type:

List[float]

maxent_grpo.rewards.uses_pure_accuracy_math_reward(reward_funcs)[source]

Return True when any configured reward resolves to pure math reward.

Parameters:

reward_funcs (Sequence[Any])

Return type:

bool

Modules

basic

Copyright 2025 Liv d'Aliberti

maxent

Copyright 2025 Liv d'Aliberti