maxent_grpo.training.types.rewards¶
Per-batch dataclasses shared across the pipeline, loss, and metrics code.
Classes
|
Grouped and flattened advantages. |
|
Additional scalar stats recorded for metrics. |
|
Completions grouped per prompt after filtering. |
|
Summary of completion lengths for metrics. |
|
Loss terms computed for a batch. |
|
Scalar contributions tracked for logging. |
|
Cached prompt tokenization used during scoring. |
|
Flattened prompt/completion pairs. |
|
Sequence-level q-distribution. |
|
Reference-model log-prob summaries. |
|
Utility values and statistics computed per batch. |
|
Summary statistics for sequence rewards. |
|
Prompt cache entries and completion tokens ready for scoring. |
|
Bundle sequence-level log-prob statistics. |
|
Handles required for the optional evaluation loop. |
- class maxent_grpo.training.types.rewards.AdvantageStats(grouped, samples)[source]¶
Bases:
objectGrouped and flattened advantages.
- class maxent_grpo.training.types.rewards.BatchDiagnostics(kl_value, clip_ratio, clip_ratio_low_mean, clip_ratio_low_min, clip_ratio_high_mean, clip_ratio_high_max, clip_ratio_region_mean, kl_per_token_by_len_bucket, kl_token_count_by_len_bucket)[source]¶
Bases:
objectAdditional scalar stats recorded for metrics.
- Parameters:
- class maxent_grpo.training.types.rewards.GenerationBatch(prompts, answers, grouped_completions, grouped_ref_meta, grouped_completion_info=None)[source]¶
Bases:
objectCompletions grouped per prompt after filtering.
- Parameters:
- class maxent_grpo.training.types.rewards.LengthStats(min_length, mean_length, max_length, clipped_ratio, min_terminated, mean_terminated, max_terminated)[source]¶
Bases:
objectSummary of completion lengths for metrics.
- Parameters:
- class maxent_grpo.training.types.rewards.LossOutputs(loss, scalars, log_ratio_train, denom_tok_tensor)[source]¶
Bases:
objectLoss terms computed for a batch.
- Parameters:
loss (Any)
scalars (LossScalarBundle)
log_ratio_train (Any)
denom_tok_tensor (Any)
- scalars: LossScalarBundle¶
- property total_loss_scalar: float¶
Convenience accessor for the total loss scalar.
- Returns:
Combined loss used for optimization/logging.
- Return type:
- property policy_loss_scalar: float¶
Convenience accessor for the policy loss scalar.
- Returns:
Policy loss contribution from
scalars.- Return type:
- property clip_loss_scalar: float | None¶
Convenience accessor for the clip-objective scalar.
- Returns:
Optional clip objective scalar, if enabled.
- Return type:
float | None
- class maxent_grpo.training.types.rewards.LossScalarBundle(total_loss, policy_loss, clip_loss, kl_loss, weighted_kl_loss)[source]¶
Bases:
objectScalar contributions tracked for logging.
- Parameters:
- class maxent_grpo.training.types.rewards.PromptCacheEntry(input_ids, attention_mask)[source]¶
Bases:
objectCached prompt tokenization used during scoring.
- class maxent_grpo.training.types.rewards.PromptCompletionBatch(prompts, completions, metadata=None)[source]¶
Bases:
objectFlattened prompt/completion pairs.
- class maxent_grpo.training.types.rewards.QDistribution(grouped, samples)[source]¶
Bases:
objectSequence-level q-distribution.
- class maxent_grpo.training.types.rewards.ReferenceLogprobs(ref_logp_sum, ref_tok_counts, ref_logp_sum_raw, ref_logp_mean, avg_completion_tokens, ref_token_logp=None, ref_token_mask=None)[source]¶
Bases:
objectReference-model log-prob summaries.
- Parameters:
- class maxent_grpo.training.types.rewards.RewardComputation(total_utils, per_reward_values, advantage, pairs, q_distribution, moments, ref_logprob_meta=None, completion_metadata=None, entropy_bonus_scale=None, seed_semantic_entropies=None, seed_advantage_scales=None, seed_alpha_effective=None, seed_max_possible_entropy=None)[source]¶
Bases:
objectUtility values and statistics computed per batch.
- Parameters:
advantage (AdvantageStats)
pairs (PromptCompletionBatch)
q_distribution (QDistribution)
moments (RewardMoments)
entropy_bonus_scale (float | None)
seed_alpha_effective (float | None)
seed_max_possible_entropy (float | None)
- advantage: AdvantageStats¶
- pairs: PromptCompletionBatch¶
- q_distribution: QDistribution¶
- moments: RewardMoments¶
- class maxent_grpo.training.types.rewards.RewardMoments(mean, std)[source]¶
Bases:
objectSummary statistics for sequence rewards.
- class maxent_grpo.training.types.rewards.ScoreBatch(prompt_entries, completion_ids, completion_attention_mask, pad_token_id, max_prompt_len, slice_size, total_sequences, score_tail_tokens=None)[source]¶
Bases:
objectPrompt cache entries and completion tokens ready for scoring.
- Parameters:
- prompt_entries: List['PromptCacheEntry']¶
- completion_ids: Tensor¶
- completion_attention_mask: Tensor¶
- class maxent_grpo.training.types.rewards.SequenceScores(cur_logp_sum, behavior_logp_sum, log_ratio_train, denom_tok_tensor, pooled_hidden=None, policy_entropy_sum=None, token_logp=None, token_mask=None, old_token_logp=None)[source]¶
Bases:
objectBundle sequence-level log-prob statistics.
- Parameters:
- class maxent_grpo.training.types.rewards.ValidationContext(evaluation, accelerator, model, tokenizer, reward, generator, logging, eval_reward=None, runtime=None, generation=None, scoring=None)[source]¶
Bases:
objectHandles required for the optional evaluation loop.
- Parameters:
evaluation (EvaluationSettings)
accelerator (Accelerator)
model (PreTrainedModel)
tokenizer (PreTrainedTokenizer)
reward (RewardSpec)
generator (GenerationFn[Any])
logging (LoggingHandles)
eval_reward (Optional[RewardSpec])
runtime (Optional['RuntimeHandles'])
generation (Optional['GenerationSettings'])
scoring (Optional['ScoringSettings'])
- evaluation: EvaluationSettings¶
- accelerator: Accelerator¶
- model: PreTrainedModel¶
- tokenizer: PreTrainedTokenizer¶
- reward: RewardSpec¶
- generator: GenerationFn[Any]¶
- logging: LoggingHandles¶
- eval_reward: RewardSpec | None = None¶