maxent_grpo.training.types.rewards

Per-batch dataclasses shared across the pipeline, loss, and metrics code.

Classes

AdvantageStats(grouped, samples)

Grouped and flattened advantages.

BatchDiagnostics(kl_value, clip_ratio, ...)

Additional scalar stats recorded for metrics.

GenerationBatch(prompts, answers, ...[, ...])

Completions grouped per prompt after filtering.

LengthStats(min_length, mean_length, ...)

Summary of completion lengths for metrics.

LossOutputs(loss, scalars, log_ratio_train, ...)

Loss terms computed for a batch.

LossScalarBundle(total_loss, policy_loss, ...)

Scalar contributions tracked for logging.

PromptCacheEntry(input_ids, attention_mask)

Cached prompt tokenization used during scoring.

PromptCompletionBatch(prompts, completions)

Flattened prompt/completion pairs.

QDistribution(grouped, samples)

Sequence-level q-distribution.

ReferenceLogprobs(ref_logp_sum, ...[, ...])

Reference-model log-prob summaries.

RewardComputation(total_utils, ...[, ...])

Utility values and statistics computed per batch.

RewardMoments(mean, std)

Summary statistics for sequence rewards.

ScoreBatch(prompt_entries, completion_ids, ...)

Prompt cache entries and completion tokens ready for scoring.

SequenceScores(cur_logp_sum, ...[, ...])

Bundle sequence-level log-prob statistics.

ValidationContext(evaluation, accelerator, ...)

Handles required for the optional evaluation loop.

class maxent_grpo.training.types.rewards.AdvantageStats(grouped, samples)[source]

Bases: object

Grouped and flattened advantages.

Parameters:
grouped: List[List[float]]
samples: List[float]
class maxent_grpo.training.types.rewards.BatchDiagnostics(kl_value, clip_ratio, clip_ratio_low_mean, clip_ratio_low_min, clip_ratio_high_mean, clip_ratio_high_max, clip_ratio_region_mean, kl_per_token_by_len_bucket, kl_token_count_by_len_bucket)[source]

Bases: object

Additional scalar stats recorded for metrics.

Parameters:
kl_value: float | None
clip_ratio: float
clip_ratio_low_mean: float
clip_ratio_low_min: float
clip_ratio_high_mean: float
clip_ratio_high_max: float
clip_ratio_region_mean: float
kl_per_token_by_len_bucket: Dict[str, float]
kl_token_count_by_len_bucket: Dict[str, float]
class maxent_grpo.training.types.rewards.GenerationBatch(prompts, answers, grouped_completions, grouped_ref_meta, grouped_completion_info=None)[source]

Bases: object

Completions grouped per prompt after filtering.

Parameters:
prompts: List[str]
answers: List[str]
grouped_completions: List[List[str]]
grouped_ref_meta: List[List[Any | None]] | None
grouped_completion_info: List[List[Dict[str, Any]]] | None = None
class maxent_grpo.training.types.rewards.LengthStats(min_length, mean_length, max_length, clipped_ratio, min_terminated, mean_terminated, max_terminated)[source]

Bases: object

Summary of completion lengths for metrics.

Parameters:
min_length: float
mean_length: float
max_length: float
clipped_ratio: float
min_terminated: float
mean_terminated: float
max_terminated: float
class maxent_grpo.training.types.rewards.LossOutputs(loss, scalars, log_ratio_train, denom_tok_tensor)[source]

Bases: object

Loss terms computed for a batch.

Parameters:
loss: Any
scalars: LossScalarBundle
log_ratio_train: Any
denom_tok_tensor: Any
property total_loss_scalar: float

Convenience accessor for the total loss scalar.

Returns:

Combined loss used for optimization/logging.

Return type:

float

property policy_loss_scalar: float

Convenience accessor for the policy loss scalar.

Returns:

Policy loss contribution from scalars.

Return type:

float

property clip_loss_scalar: float | None

Convenience accessor for the clip-objective scalar.

Returns:

Optional clip objective scalar, if enabled.

Return type:

float | None

property kl_loss_scalar: float

Convenience accessor for the KL scalar.

Returns:

KL divergence scalar for the batch.

Return type:

float

property weighted_kl_loss_scalar: float

Convenience accessor for the weighted KL scalar.

Returns:

Weighted KL scalar using the configured beta.

Return type:

float

class maxent_grpo.training.types.rewards.LossScalarBundle(total_loss, policy_loss, clip_loss, kl_loss, weighted_kl_loss)[source]

Bases: object

Scalar contributions tracked for logging.

Parameters:
total_loss: float
policy_loss: float
clip_loss: float | None
kl_loss: float
weighted_kl_loss: float
class maxent_grpo.training.types.rewards.PromptCacheEntry(input_ids, attention_mask)[source]

Bases: object

Cached prompt tokenization used during scoring.

Parameters:
input_ids: List[int]
attention_mask: List[int]
property length: int

Return cached prompt length.

Returns:

Number of tokens in the cached prompt.

Return type:

int

class maxent_grpo.training.types.rewards.PromptCompletionBatch(prompts, completions, metadata=None)[source]

Bases: object

Flattened prompt/completion pairs.

Parameters:
prompts: List[str]
completions: List[str]
metadata: List[Dict[str, Any]] | None = None
class maxent_grpo.training.types.rewards.QDistribution(grouped, samples)[source]

Bases: object

Sequence-level q-distribution.

Parameters:
grouped: List[List[float]]
samples: List[float]
class maxent_grpo.training.types.rewards.ReferenceLogprobs(ref_logp_sum, ref_tok_counts, ref_logp_sum_raw, ref_logp_mean, avg_completion_tokens, ref_token_logp=None, ref_token_mask=None)[source]

Bases: object

Reference-model log-prob summaries.

Parameters:
  • ref_logp_sum (Any)

  • ref_tok_counts (Any)

  • ref_logp_sum_raw (Any)

  • ref_logp_mean (float)

  • avg_completion_tokens (float)

  • ref_token_logp (Any | None)

  • ref_token_mask (Any | None)

ref_logp_sum: Any
ref_tok_counts: Any
ref_logp_sum_raw: Any
ref_logp_mean: float
avg_completion_tokens: float
ref_token_logp: Any | None = None
ref_token_mask: Any | None = None
class maxent_grpo.training.types.rewards.RewardComputation(total_utils, per_reward_values, advantage, pairs, q_distribution, moments, ref_logprob_meta=None, completion_metadata=None, entropy_bonus_scale=None, seed_semantic_entropies=None, seed_advantage_scales=None, seed_alpha_effective=None, seed_max_possible_entropy=None)[source]

Bases: object

Utility values and statistics computed per batch.

Parameters:
total_utils: List[float]
per_reward_values: Dict[str, List[float]]
advantage: AdvantageStats
pairs: PromptCompletionBatch
q_distribution: QDistribution
moments: RewardMoments
ref_logprob_meta: List[Any | None] | None = None
completion_metadata: List[Dict[str, Any]] | None = None
entropy_bonus_scale: float | None = None
seed_semantic_entropies: List[float] | None = None
seed_advantage_scales: List[float] | None = None
seed_alpha_effective: float | None = None
seed_max_possible_entropy: float | None = None
property advantage_samples: List[float]

Return flattened advantage samples for logging.

Returns:

Advantage samples concatenated across prompts.

Return type:

list[float]

property q_grouped: List[List[float]]

Expose grouped q-values for downstream weighting.

Returns:

Per-prompt q values ready for weighting.

Return type:

list[list[float]]

property train_reward_mean: float

Return the cached mean reward.

Returns:

Average reward value for the processed batch.

Return type:

float

property train_reward_std: float

Return the cached reward standard deviation.

Returns:

Standard deviation of batch reward values.

Return type:

float

class maxent_grpo.training.types.rewards.RewardMoments(mean, std)[source]

Bases: object

Summary statistics for sequence rewards.

Parameters:
mean: float
std: float
class maxent_grpo.training.types.rewards.ScoreBatch(prompt_entries, completion_ids, completion_attention_mask, pad_token_id, max_prompt_len, slice_size, total_sequences, score_tail_tokens=None)[source]

Bases: object

Prompt cache entries and completion tokens ready for scoring.

Parameters:
prompt_entries: List['PromptCacheEntry']
completion_ids: Tensor
completion_attention_mask: Tensor
pad_token_id: int
max_prompt_len: int
slice_size: int
total_sequences: int
score_tail_tokens: int | None = None
class maxent_grpo.training.types.rewards.SequenceScores(cur_logp_sum, behavior_logp_sum, log_ratio_train, denom_tok_tensor, pooled_hidden=None, policy_entropy_sum=None, token_logp=None, token_mask=None, old_token_logp=None)[source]

Bases: object

Bundle sequence-level log-prob statistics.

Parameters:
  • cur_logp_sum (Any)

  • behavior_logp_sum (Any)

  • log_ratio_train (Any)

  • denom_tok_tensor (Any)

  • pooled_hidden (Any | None)

  • policy_entropy_sum (Any | None)

  • token_logp (Any | None)

  • token_mask (Any | None)

  • old_token_logp (Any | None)

cur_logp_sum: Any
behavior_logp_sum: Any
log_ratio_train: Any
denom_tok_tensor: Any
pooled_hidden: Any | None = None
policy_entropy_sum: Any | None = None
token_logp: Any | None = None
token_mask: Any | None = None
old_token_logp: Any | None = None
class maxent_grpo.training.types.rewards.ValidationContext(evaluation, accelerator, model, tokenizer, reward, generator, logging, eval_reward=None, runtime=None, generation=None, scoring=None)[source]

Bases: object

Handles required for the optional evaluation loop.

Parameters:
evaluation: EvaluationSettings
accelerator: Accelerator
model: PreTrainedModel
tokenizer: PreTrainedTokenizer
reward: RewardSpec
generator: GenerationFn[Any]
logging: LoggingHandles
eval_reward: RewardSpec | None = None
runtime: 'RuntimeHandles' | None = None
generation: 'GenerationSettings' | None = None
scoring: 'ScoringSettings' | None = None