maxent_grpo.training.optim¶
Copyright 2025 Liv d’Aliberti
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Optimizer and gradient utilities shared across the training loop.
Functions
|
Drop optimizer kwargs unsupported by lightweight stubs or callables. |
|
Set the provided learning rate on all optimizer parameter groups. |
|
Construct an optimizer/scheduler bundle that mirrors GRPO defaults. |
|
Clip gradients via Accelerate when possible and return the norm. |
|
Pass gradient accumulation steps to Accelerate when supported. |
|
Return DeepSpeed usage flags derived from the accelerator state. |
|
Return floating-point epoch progress for logging. |
|
Perform an optimizer step and advance |
|
Return an accumulation context compatible with the current strategy. |
|
Return the learning rate for the given optimizer step. |
|
Return the |
Classes
|
Describe whether the current accelerator session uses DeepSpeed. |
|
- class maxent_grpo.training.optim.DeepspeedState(use_deepspeed, zero_stage)[source]¶
Bases:
objectDescribe whether the current accelerator session uses DeepSpeed.
- maxent_grpo.training.optim.apply_learning_rate(handles, learning_rate)[source]¶
Set the provided learning rate on all optimizer parameter groups.
- Parameters:
handles (training.types.OptimizerHandles) – Wrapper containing the primary/base optimizers.
learning_rate (float) – Learning rate to apply across all parameter groups.
- Return type:
None
- maxent_grpo.training.optim.clip_grad_norm_local(model, accelerator, max_grad_norm)[source]¶
Clip gradients via Accelerate when possible and return the norm.
- Parameters:
model (torch.nn.Module) – Model whose gradients should be clipped.
accelerator (Accelerator) – Accelerate handle providing
clip_grad_norm_.max_grad_norm (float) – Maximum norm applied during clipping.
- Returns:
Gradient norm when clipping occurs, otherwise
None.- Return type:
float | None
- maxent_grpo.training.optim.configure_accumulation_steps(accelerator, grad_accum_steps)[source]¶
Pass gradient accumulation steps to Accelerate when supported.
- Parameters:
accelerator (Accelerator) – Accelerate handle used to configure accumulation.
grad_accum_steps (int) – Desired gradient accumulation steps.
- Return type:
None
- maxent_grpo.training.optim.detect_deepspeed_state(accelerator)[source]¶
Return DeepSpeed usage flags derived from the accelerator state.
- Parameters:
accelerator (Accelerator) – Accelerator instance whose state is inspected.
- Returns:
DeepspeedStatedescribing DeepSpeed usage and ZeRO stage.- Return type:
- maxent_grpo.training.optim.epoch_progress(schedule, epoch, step_in_epoch)[source]¶
Return floating-point epoch progress for logging.
- Parameters:
schedule (OptimizationSchedule) – Optimization schedule describing steps per epoch.
epoch (int) – Current epoch index (zero-based).
step_in_epoch (int) – Step index inside the current epoch.
- Returns:
Floating-point epoch progress suitable for logs.
- Return type:
- maxent_grpo.training.optim.optimizer_step(ctx, state, current_lr)[source]¶
Perform an optimizer step and advance
state.global_step.- Parameters:
ctx (training.types.TrainingLoopContext) – Training context containing optimizer handles.
state (
TrainingLoopState) – Mutable training state tracking global steps.current_lr (float) – Learning rate to apply before stepping.
- Returns:
Gradient norm (if available) for metrics/logging.
- Return type:
float | None
- maxent_grpo.training.optim.require_accumulation_context(accelerator, model)[source]¶
Return an accumulation context compatible with the current strategy.
- Parameters:
accelerator (Accelerator) – Accelerator instance providing
accumulate.model (Any) – Model passed to
accelerator.accumulatewhen available.
- Returns:
Context manager used to guard gradient accumulation.
- Raises:
RuntimeError – If accumulation is required but unavailable.
- Return type:
- maxent_grpo.training.optim.scheduled_learning_rate(schedule, handles, step)[source]¶
Return the learning rate for the given optimizer step.
- maxent_grpo.training.optim.sync_gradients_enabled(accelerator, global_step)[source]¶
Return the
sync_gradientsflag and log it for debugging.- Parameters:
accelerator (Accelerator) – Accelerator instance exposing
sync_gradients.global_step (int) – Current optimizer step used for debug logging.
- Returns:
Trueif gradients should be synchronized this step.- Return type:
- maxent_grpo.training.optim.build_optimization_handles(model, cfg)[source]¶
Construct an optimizer/scheduler bundle that mirrors GRPO defaults.
The implementation follows the same AdamW parameter‑group semantics used by Hugging Face Trainer/TRL GRPO:
Parameters whose names contain
"bias"or"LayerNorm.weight"are placed in a no‑decay group (weight_decay=0.0).All other trainable parameters share a decay group with
weight_decay=cfg.weight_decay.Optimizer hyperparameters (learning rate, betas, epsilon) are taken from the GRPO/TrainingArguments instance so that MaxEnt runs stay aligned with the baseline GRPO trainer.
- Parameters:
- Returns:
OptimizerHandleswith optimizer and metadata.- Return type:
- Raises:
ImportError – If torch is unavailable.