maxent_grpo.training.types.rewards¶

Per-batch dataclasses shared across the pipeline, loss, and metrics code.

Classes

`AdvantageStats`(grouped, samples)	Grouped and flattened advantages.
`BatchDiagnostics`(kl_value, clip_ratio, ...)	Additional scalar stats recorded for metrics.
`GenerationBatch`(prompts, answers, ...[, ...])	Completions grouped per prompt after filtering.
`LengthStats`(min_length, mean_length, ...)	Summary of completion lengths for metrics.
`LossOutputs`(loss, scalars, log_ratio_train, ...)	Loss terms computed for a batch.
`LossScalarBundle`(total_loss, policy_loss, ...)	Scalar contributions tracked for logging.
`PromptCacheEntry`(input_ids, attention_mask)	Cached prompt tokenization used during scoring.
`PromptCompletionBatch`(prompts, completions)	Flattened prompt/completion pairs.
`QDistribution`(grouped, samples)	Sequence-level q-distribution.
`ReferenceLogprobs`(ref_logp_sum, ...[, ...])	Reference-model log-prob summaries.
`RewardComputation`(total_utils, ...[, ...])	Utility values and statistics computed per batch.
`RewardMoments`(mean, std)	Summary statistics for sequence rewards.
`ScoreBatch`(prompt_entries, completion_ids, ...)	Prompt cache entries and completion tokens ready for scoring.
`SequenceScores`(cur_logp_sum, ...[, ...])	Bundle sequence-level log-prob statistics.
`ValidationContext`(evaluation, accelerator, ...)	Handles required for the optional evaluation loop.

class maxent_grpo.training.types.rewards.AdvantageStats(grouped, samples)[source]¶

Bases: object

Grouped and flattened advantages.

Parameters:

grouped (List[List[float]])
samples (List[float])

grouped: List[List[float]]¶

samples: List[float]¶

class maxent_grpo.training.types.rewards.BatchDiagnostics(kl_value, clip_ratio, clip_ratio_low_mean, clip_ratio_low_min, clip_ratio_high_mean, clip_ratio_high_max, clip_ratio_region_mean, kl_per_token_by_len_bucket, kl_token_count_by_len_bucket)[source]¶

Bases: object

Additional scalar stats recorded for metrics.

Parameters:

kl_value (float | None)
clip_ratio (float)
clip_ratio_low_mean (float)
clip_ratio_low_min (float)
clip_ratio_high_mean (float)
clip_ratio_high_max (float)
clip_ratio_region_mean (float)
kl_per_token_by_len_bucket (Dict[str, float])
kl_token_count_by_len_bucket (Dict[str, float])

kl_value: float | None¶

clip_ratio: float¶

clip_ratio_low_mean: float¶

clip_ratio_low_min: float¶

clip_ratio_high_mean: float¶

clip_ratio_high_max: float¶

clip_ratio_region_mean: float¶

kl_per_token_by_len_bucket: Dict[str, float]¶

kl_token_count_by_len_bucket: Dict[str, float]¶

class maxent_grpo.training.types.rewards.GenerationBatch(prompts, answers, grouped_completions, grouped_ref_meta, grouped_completion_info=None)[source]¶

Bases: object

Completions grouped per prompt after filtering.

Parameters:

prompts (List[str])
answers (List[str])
grouped_completions (List[List[str]])
grouped_ref_meta (List[List[Any | None]] | None)
grouped_completion_info (List[List[Dict[str, Any]]] | None)

prompts: List[str]¶

answers: List[str]¶

grouped_completions: List[List[str]]¶

grouped_ref_meta: List[List[Any | None]] | None¶

grouped_completion_info: List[List[Dict[str, Any]]] | None = None¶

class maxent_grpo.training.types.rewards.LengthStats(min_length, mean_length, max_length, clipped_ratio, min_terminated, mean_terminated, max_terminated)[source]¶

Bases: object

Summary of completion lengths for metrics.

Parameters:

min_length (float)
mean_length (float)
max_length (float)
clipped_ratio (float)
min_terminated (float)
mean_terminated (float)
max_terminated (float)

min_length: float¶

mean_length: float¶

max_length: float¶

clipped_ratio: float¶

min_terminated: float¶

mean_terminated: float¶

max_terminated: float¶

class maxent_grpo.training.types.rewards.LossOutputs(loss, scalars, log_ratio_train, denom_tok_tensor)[source]¶

Bases: object

Loss terms computed for a batch.

Parameters:

loss (Any)
scalars (LossScalarBundle)
log_ratio_train (Any)
denom_tok_tensor (Any)

loss: Any¶

scalars: LossScalarBundle¶

log_ratio_train: Any¶

denom_tok_tensor: Any¶

property total_loss_scalar: float¶

Convenience accessor for the total loss scalar.

Returns:: Combined loss used for optimization/logging.
Return type:: float

property policy_loss_scalar: float¶

Convenience accessor for the policy loss scalar.

Returns:: Policy loss contribution from scalars.
Return type:: float

property clip_loss_scalar: float | None¶

Convenience accessor for the clip-objective scalar.

Returns:: Optional clip objective scalar, if enabled.
Return type:: float | None

property kl_loss_scalar: float¶

Convenience accessor for the KL scalar.

Returns:: KL divergence scalar for the batch.
Return type:: float

property weighted_kl_loss_scalar: float¶

Convenience accessor for the weighted KL scalar.

Returns:: Weighted KL scalar using the configured beta.
Return type:: float

class maxent_grpo.training.types.rewards.LossScalarBundle(total_loss, policy_loss, clip_loss, kl_loss, weighted_kl_loss)[source]¶

Bases: object

Scalar contributions tracked for logging.

Parameters:

total_loss (float)
policy_loss (float)
clip_loss (float | None)
kl_loss (float)
weighted_kl_loss (float)

total_loss: float¶

policy_loss: float¶

clip_loss: float | None¶

kl_loss: float¶

weighted_kl_loss: float¶

class maxent_grpo.training.types.rewards.PromptCacheEntry(input_ids, attention_mask)[source]¶

Bases: object

Cached prompt tokenization used during scoring.

Parameters:

input_ids (List[int])
attention_mask (List[int])

input_ids: List[int]¶

attention_mask: List[int]¶

property length: int¶

Return cached prompt length.

Returns:: Number of tokens in the cached prompt.
Return type:: int

class maxent_grpo.training.types.rewards.PromptCompletionBatch(prompts, completions, metadata=None)[source]¶

Bases: object

Flattened prompt/completion pairs.

Parameters:

prompts (List[str])
completions (List[str])
metadata (List[Dict[str, Any]] | None)

prompts: List[str]¶

completions: List[str]¶

metadata: List[Dict[str, Any]] | None = None¶

class maxent_grpo.training.types.rewards.QDistribution(grouped, samples)[source]¶

Bases: object

Sequence-level q-distribution.

Parameters:

grouped (List[List[float]])
samples (List[float])

grouped: List[List[float]]¶

samples: List[float]¶

class maxent_grpo.training.types.rewards.ReferenceLogprobs(ref_logp_sum, ref_tok_counts, ref_logp_sum_raw, ref_logp_mean, avg_completion_tokens, ref_token_logp=None, ref_token_mask=None)[source]¶

Bases: object

Reference-model log-prob summaries.

Parameters:

ref_logp_sum (Any)
ref_tok_counts (Any)
ref_logp_sum_raw (Any)
ref_logp_mean (float)
avg_completion_tokens (float)
ref_token_logp (Any | None)
ref_token_mask (Any | None)

ref_logp_sum: Any¶

ref_tok_counts: Any¶

ref_logp_sum_raw: Any¶

ref_logp_mean: float¶

avg_completion_tokens: float¶

ref_token_logp: Any | None = None¶

ref_token_mask: Any | None = None¶

class maxent_grpo.training.types.rewards.RewardComputation(total_utils, per_reward_values, advantage, pairs, q_distribution, moments, ref_logprob_meta=None, completion_metadata=None, entropy_bonus_scale=None, seed_semantic_entropies=None, seed_advantage_scales=None, seed_alpha_effective=None, seed_max_possible_entropy=None)[source]¶

Bases: object

Utility values and statistics computed per batch.

Parameters:

total_utils (List[float])
per_reward_values (Dict[str, List[float]])
advantage (AdvantageStats)
pairs (PromptCompletionBatch)
q_distribution (QDistribution)
moments (RewardMoments)
ref_logprob_meta (List[Any | None] | None)
completion_metadata (List[Dict[str, Any]] | None)
entropy_bonus_scale (float | None)
seed_semantic_entropies (List[float] | None)
seed_advantage_scales (List[float] | None)
seed_alpha_effective (float | None)
seed_max_possible_entropy (float | None)

total_utils: List[float]¶

per_reward_values: Dict[str, List[float]]¶

advantage: AdvantageStats¶

pairs: PromptCompletionBatch¶

q_distribution: QDistribution¶

moments: RewardMoments¶

ref_logprob_meta: List[Any | None] | None = None¶

completion_metadata: List[Dict[str, Any]] | None = None¶

entropy_bonus_scale: float | None = None¶

seed_semantic_entropies: List[float] | None = None¶

seed_advantage_scales: List[float] | None = None¶

seed_alpha_effective: float | None = None¶

seed_max_possible_entropy: float | None = None¶

property advantage_samples: List[float]¶

Return flattened advantage samples for logging.

Returns:: Advantage samples concatenated across prompts.
Return type:: list[float]

property q_grouped: List[List[float]]¶

Expose grouped q-values for downstream weighting.

Returns:: Per-prompt q values ready for weighting.
Return type:: list[list[float]]

property train_reward_mean: float¶

Return the cached mean reward.

Returns:: Average reward value for the processed batch.
Return type:: float

property train_reward_std: float¶

Return the cached reward standard deviation.

Returns:: Standard deviation of batch reward values.
Return type:: float

class maxent_grpo.training.types.rewards.RewardMoments(mean, std)[source]¶

Bases: object

Summary statistics for sequence rewards.

Parameters:

mean (float)
std (float)

mean: float¶

std: float¶

class maxent_grpo.training.types.rewards.ScoreBatch(prompt_entries, completion_ids, completion_attention_mask, pad_token_id, max_prompt_len, slice_size, total_sequences, score_tail_tokens=None)[source]¶

Bases: object

Prompt cache entries and completion tokens ready for scoring.

Parameters:

prompt_entries (List[PromptCacheEntry])
completion_ids (Any)
completion_attention_mask (Any)
pad_token_id (int)
max_prompt_len (int)
slice_size (int)
total_sequences (int)
score_tail_tokens (int | None)

prompt_entries: List['PromptCacheEntry']¶

completion_ids: Tensor¶

completion_attention_mask: Tensor¶

pad_token_id: int¶

max_prompt_len: int¶

slice_size: int¶

total_sequences: int¶

score_tail_tokens: int | None = None¶

class maxent_grpo.training.types.rewards.SequenceScores(cur_logp_sum, behavior_logp_sum, log_ratio_train, denom_tok_tensor, pooled_hidden=None, policy_entropy_sum=None, token_logp=None, token_mask=None, old_token_logp=None)[source]¶

Bases: object

Bundle sequence-level log-prob statistics.

Parameters:

cur_logp_sum (Any)
behavior_logp_sum (Any)
log_ratio_train (Any)
denom_tok_tensor (Any)
pooled_hidden (Any | None)
policy_entropy_sum (Any | None)
token_logp (Any | None)
token_mask (Any | None)
old_token_logp (Any | None)

cur_logp_sum: Any¶

behavior_logp_sum: Any¶

log_ratio_train: Any¶

denom_tok_tensor: Any¶

pooled_hidden: Any | None = None¶

policy_entropy_sum: Any | None = None¶

token_logp: Any | None = None¶

token_mask: Any | None = None¶

old_token_logp: Any | None = None¶

class maxent_grpo.training.types.rewards.ValidationContext(evaluation, accelerator, model, tokenizer, reward, generator, logging, eval_reward=None, runtime=None, generation=None, scoring=None)[source]¶