maxent_grpo.training.rewards¶

Reward and generation helpers extracted from the training loop.

Functions

`_apply_group_scales`(advantage_grouped, ...)	Scale grouped advantages by per-prompt multipliers.
`_build_reward_proxy`(source, reward_names)	Preserve source config attributes when instantiating reward helpers.
`_call_reward_fn`(reward_fn, completions, ...)	Call a reward fn with backward-compatible kwargs handling.
`_coerce_reward_names`(raw_names)	Return a list of reward identifiers from arbitrary inputs.
`_completion_was_truncated`(metadata, *, ...)	Return `True` when completion metadata indicates a length stop.
`_compute_seed_grpo_statistics`(gen_batch, *, ...)	Return per-prompt semantic entropies and advantage scales for SEED-GRPO.
`_extract_completion_runtime_info`(entry_dict)	Return compact completion metadata needed by scoring and truncation logic.
`_extract_ref_logprob_fields`(meta_entry)	Return `(logprob_sum, token_count)` when present in metadata entries.
`_group_q_distribution`(grouped_comps, ...)	Return per-group q distributions derived from listwise utilities.
`_has_recipe_path`(obj)	Return `True` when the object carries a recipe path marker.
`_rank_tag`()	Return best-effort rank string for logging.
`_sanitize_ref_logprob_meta`(flat_meta, ...)	Drop reference metadata when any entry is missing logprob information.
`_seed_extract_answer`(text)	Return the raw final answer string used for SEED-GRPO clustering.
`_seed_logsumexp`(values)	Return a numerically stable log-sum-exp over `values`.
`_seed_logsumexp_by_id`(semantic_ids, ...)	Aggregate normalized cluster log-mass by semantic id.
`_seed_predictive_entropy_rao`(cluster_log_probs)	Return Rao-style predictive entropy over normalized cluster mass.
`_seed_semantic_ids_by_answers`(answers_list)	Match the official SEED-GRPO exact-answer clustering rule.
`_zero_truncated_completion_rewards`(...)	Zero sequence rewards for samples that appear truncated.
`compute_reward_statistics`(gen_batch, ...[, ...])	Compute utilities, q-distributions, and flattened prompt/completion pairs.
`compute_reward_totals`(reward_spec, ...)	Evaluate reward functions and aggregate per-sequence utilities.
`group_advantages`(grouped_comps, total_utils, *)	Return normalized advantages per prompt group and flattened samples.
`load_eval_reward_functions`(script_args, ...)	Resolve eval reward functions/weights, defaulting to training rewards.
`load_reward_functions`(script_args, tokenizer)	Resolve reward functions/weights from script or training args.
`prepare_generation_batch`(batch, generator, ...)	Generate completions and retry prompts that initially returned nothing.
`reward_moments`(total_utils, device)	Compute reward mean/std on CPU or current accelerator device.

maxent_grpo.training.rewards.compute_reward_statistics(gen_batch, reward_spec, device, q_temperature, q_epsilon, controller_beta=None, controller_tau=None, scale_rewards=True, zero_truncated_completion_rewards=False, max_completion_len=None, seed_grpo_enabled=False, seed_grpo_alpha=0.0417, seed_grpo_alpha_normalize_by_max_entropy=True, seed_grpo_length_normalize_logprobs=True, seed_grpo_num_generations=None)[source]¶

Compute utilities, q-distributions, and flattened prompt/completion pairs.

Parameters:

gen_batch (GenerationBatch) – Generation batch containing grouped completions/meta.
reward_spec (RewardSpec) – Reward configuration (functions + weights).
device (torch.device) – Torch device used for reward moment computations.
q_temperature (float) – Temperature used when forming q-distributions.
q_epsilon (float) – Epsilon floor ensuring full support in q-distribution.
controller_beta (float | None) – Optional KL controller beta logged with stats.
controller_tau (float | None) – Optional controller tau logged alongside q temp.
scale_rewards (bool)
zero_truncated_completion_rewards (bool)
max_completion_len (int | None)
seed_grpo_enabled (bool)
seed_grpo_alpha (float)
seed_grpo_alpha_normalize_by_max_entropy (bool)
seed_grpo_length_normalize_logprobs (bool)
seed_grpo_num_generations (int | None)

Returns:

Populated RewardComputation or None when inputs are empty.

Return type:

RewardComputation | None

maxent_grpo.training.rewards.prepare_generation_batch(batch, generator, generation_stats, expected_generations, max_retry_rounds=None)[source]¶

Generate completions and retry prompts that initially returned nothing.

Parameters:

batch (dict[str, list[str]]) – Mini-batch containing prompt/answer lists.
generator (GenerationFn) – Callable that produces grouped completions and metadata.
generation_stats (dict[str, int]) – Mutable statistics dictionary updated in-place.
expected_generations (int) – Desired completions per prompt.
max_retry_rounds (int | None) – Optional cap overriding the default retry limit.

Returns:

Populated GenerationBatch or None if generation fails after retries.

Return type:

GenerationBatch | None

maxent_grpo.training.rewards.load_reward_functions(script_args, tokenizer, training_args=None)[source]¶

Resolve reward functions/weights from script or training args.

Parameters:

script_args (Any) – Script arguments carrying reward names/weights.
tokenizer (Any) – Tokenizer passed to reward function factory helpers.
training_args (Any | None) – Optional training config that can override script rewards.

Returns:

Tuple of (reward_funcs, reward_weights).

Return type:

tuple[list, list]

maxent_grpo.training.rewards.load_eval_reward_functions(script_args, tokenizer, training_args=None)[source]¶

Resolve eval reward functions/weights, defaulting to training rewards.

Parameters:

script_args (Any) – Script arguments containing eval-specific reward settings.
tokenizer (Any) – Tokenizer passed to reward function factory helpers.
training_args (Any | None) – Optional training config with reward overrides.

Returns:

Tuple of (reward_funcs, reward_weights) for evaluation.

Return type:

tuple[list, list]