maxent_grpo.training.rewards

Reward and generation helpers extracted from the training loop.

Functions

_apply_group_scales(advantage_grouped, ...)

Scale grouped advantages by per-prompt multipliers.

_build_reward_proxy(source, reward_names)

Preserve source config attributes when instantiating reward helpers.

_call_reward_fn(reward_fn, completions, ...)

Call a reward fn with backward-compatible kwargs handling.

_coerce_reward_names(raw_names)

Return a list of reward identifiers from arbitrary inputs.

_completion_was_truncated(metadata, *, ...)

Return True when completion metadata indicates a length stop.

_compute_seed_grpo_statistics(gen_batch, *, ...)

Return per-prompt semantic entropies and advantage scales for SEED-GRPO.

_extract_completion_runtime_info(entry_dict)

Return compact completion metadata needed by scoring and truncation logic.

_extract_ref_logprob_fields(meta_entry)

Return (logprob_sum, token_count) when present in metadata entries.

_group_q_distribution(grouped_comps, ...)

Return per-group q distributions derived from listwise utilities.

_has_recipe_path(obj)

Return True when the object carries a recipe path marker.

_rank_tag()

Return best-effort rank string for logging.

_sanitize_ref_logprob_meta(flat_meta, ...)

Drop reference metadata when any entry is missing logprob information.

_seed_extract_answer(text)

Return the raw final answer string used for SEED-GRPO clustering.

_seed_logsumexp(values)

Return a numerically stable log-sum-exp over values.

_seed_logsumexp_by_id(semantic_ids, ...)

Aggregate normalized cluster log-mass by semantic id.

_seed_predictive_entropy_rao(cluster_log_probs)

Return Rao-style predictive entropy over normalized cluster mass.

_seed_semantic_ids_by_answers(answers_list)

Match the official SEED-GRPO exact-answer clustering rule.

_zero_truncated_completion_rewards(...)

Zero sequence rewards for samples that appear truncated.

compute_reward_statistics(gen_batch, ...[, ...])

Compute utilities, q-distributions, and flattened prompt/completion pairs.

compute_reward_totals(reward_spec, ...)

Evaluate reward functions and aggregate per-sequence utilities.

group_advantages(grouped_comps, total_utils, *)

Return normalized advantages per prompt group and flattened samples.

load_eval_reward_functions(script_args, ...)

Resolve eval reward functions/weights, defaulting to training rewards.

load_reward_functions(script_args, tokenizer)

Resolve reward functions/weights from script or training args.

prepare_generation_batch(batch, generator, ...)

Generate completions and retry prompts that initially returned nothing.

reward_moments(total_utils, device)

Compute reward mean/std on CPU or current accelerator device.

maxent_grpo.training.rewards.compute_reward_statistics(gen_batch, reward_spec, device, q_temperature, q_epsilon, controller_beta=None, controller_tau=None, scale_rewards=True, zero_truncated_completion_rewards=False, max_completion_len=None, seed_grpo_enabled=False, seed_grpo_alpha=0.0417, seed_grpo_alpha_normalize_by_max_entropy=True, seed_grpo_length_normalize_logprobs=True, seed_grpo_num_generations=None)[source]

Compute utilities, q-distributions, and flattened prompt/completion pairs.

Parameters:
  • gen_batch (GenerationBatch) – Generation batch containing grouped completions/meta.

  • reward_spec (RewardSpec) – Reward configuration (functions + weights).

  • device (torch.device) – Torch device used for reward moment computations.

  • q_temperature (float) – Temperature used when forming q-distributions.

  • q_epsilon (float) – Epsilon floor ensuring full support in q-distribution.

  • controller_beta (float | None) – Optional KL controller beta logged with stats.

  • controller_tau (float | None) – Optional controller tau logged alongside q temp.

  • scale_rewards (bool)

  • zero_truncated_completion_rewards (bool)

  • max_completion_len (int | None)

  • seed_grpo_enabled (bool)

  • seed_grpo_alpha (float)

  • seed_grpo_alpha_normalize_by_max_entropy (bool)

  • seed_grpo_length_normalize_logprobs (bool)

  • seed_grpo_num_generations (int | None)

Returns:

Populated RewardComputation or None when inputs are empty.

Return type:

RewardComputation | None

maxent_grpo.training.rewards.prepare_generation_batch(batch, generator, generation_stats, expected_generations, max_retry_rounds=None)[source]

Generate completions and retry prompts that initially returned nothing.

Parameters:
  • batch (dict[str, list[str]]) – Mini-batch containing prompt/answer lists.

  • generator (GenerationFn) – Callable that produces grouped completions and metadata.

  • generation_stats (dict[str, int]) – Mutable statistics dictionary updated in-place.

  • expected_generations (int) – Desired completions per prompt.

  • max_retry_rounds (int | None) – Optional cap overriding the default retry limit.

Returns:

Populated GenerationBatch or None if generation fails after retries.

Return type:

GenerationBatch | None

maxent_grpo.training.rewards.load_reward_functions(script_args, tokenizer, training_args=None)[source]

Resolve reward functions/weights from script or training args.

Parameters:
  • script_args (Any) – Script arguments carrying reward names/weights.

  • tokenizer (Any) – Tokenizer passed to reward function factory helpers.

  • training_args (Any | None) – Optional training config that can override script rewards.

Returns:

Tuple of (reward_funcs, reward_weights).

Return type:

tuple[list, list]

maxent_grpo.training.rewards.load_eval_reward_functions(script_args, tokenizer, training_args=None)[source]

Resolve eval reward functions/weights, defaulting to training rewards.

Parameters:
  • script_args (Any) – Script arguments containing eval-specific reward settings.

  • tokenizer (Any) – Tokenizer passed to reward function factory helpers.

  • training_args (Any | None) – Optional training config with reward overrides.

Returns:

Tuple of (reward_funcs, reward_weights) for evaluation.

Return type:

tuple[list, list]