maxent_grpo.training.rewards¶
Reward and generation helpers extracted from the training loop.
Functions
|
Scale grouped advantages by per-prompt multipliers. |
|
Preserve source config attributes when instantiating reward helpers. |
|
Call a reward fn with backward-compatible kwargs handling. |
|
Return a list of reward identifiers from arbitrary inputs. |
|
Return |
|
Return per-prompt semantic entropies and advantage scales for SEED-GRPO. |
|
Return compact completion metadata needed by scoring and truncation logic. |
|
Return |
|
Return per-group q distributions derived from listwise utilities. |
|
Return |
|
Return best-effort rank string for logging. |
|
Drop reference metadata when any entry is missing logprob information. |
|
Return the raw final answer string used for SEED-GRPO clustering. |
|
Return a numerically stable log-sum-exp over |
|
Aggregate normalized cluster log-mass by semantic id. |
|
Return Rao-style predictive entropy over normalized cluster mass. |
|
Match the official SEED-GRPO exact-answer clustering rule. |
|
Zero sequence rewards for samples that appear truncated. |
|
Compute utilities, q-distributions, and flattened prompt/completion pairs. |
|
Evaluate reward functions and aggregate per-sequence utilities. |
|
Return normalized advantages per prompt group and flattened samples. |
|
Resolve eval reward functions/weights, defaulting to training rewards. |
|
Resolve reward functions/weights from script or training args. |
|
Generate completions and retry prompts that initially returned nothing. |
|
Compute reward mean/std on CPU or current accelerator device. |
- maxent_grpo.training.rewards.compute_reward_statistics(gen_batch, reward_spec, device, q_temperature, q_epsilon, controller_beta=None, controller_tau=None, scale_rewards=True, zero_truncated_completion_rewards=False, max_completion_len=None, seed_grpo_enabled=False, seed_grpo_alpha=0.0417, seed_grpo_alpha_normalize_by_max_entropy=True, seed_grpo_length_normalize_logprobs=True, seed_grpo_num_generations=None)[source]¶
Compute utilities, q-distributions, and flattened prompt/completion pairs.
- Parameters:
gen_batch (
GenerationBatch) – Generation batch containing grouped completions/meta.reward_spec (RewardSpec) – Reward configuration (functions + weights).
device (
torch.device) – Torch device used for reward moment computations.q_temperature (float) – Temperature used when forming q-distributions.
q_epsilon (float) – Epsilon floor ensuring full support in q-distribution.
controller_beta (float | None) – Optional KL controller beta logged with stats.
controller_tau (float | None) – Optional controller tau logged alongside q temp.
scale_rewards (bool)
zero_truncated_completion_rewards (bool)
max_completion_len (int | None)
seed_grpo_enabled (bool)
seed_grpo_alpha (float)
seed_grpo_alpha_normalize_by_max_entropy (bool)
seed_grpo_length_normalize_logprobs (bool)
seed_grpo_num_generations (int | None)
- Returns:
Populated
RewardComputationorNonewhen inputs are empty.- Return type:
RewardComputation| None
- maxent_grpo.training.rewards.prepare_generation_batch(batch, generator, generation_stats, expected_generations, max_retry_rounds=None)[source]¶
Generate completions and retry prompts that initially returned nothing.
- Parameters:
batch (dict[str, list[str]]) – Mini-batch containing
prompt/answerlists.generator (
GenerationFn) – Callable that produces grouped completions and metadata.generation_stats (dict[str, int]) – Mutable statistics dictionary updated in-place.
expected_generations (int) – Desired completions per prompt.
max_retry_rounds (int | None) – Optional cap overriding the default retry limit.
- Returns:
Populated
GenerationBatchorNoneif generation fails after retries.- Return type:
GenerationBatch| None
- maxent_grpo.training.rewards.load_reward_functions(script_args, tokenizer, training_args=None)[source]¶
Resolve reward functions/weights from script or training args.
- Parameters:
- Returns:
Tuple of
(reward_funcs, reward_weights).- Return type: