maxent_grpo.training¶
Core helpers powering the MaxEnt-GRPO trainer.
- class maxent_grpo.training.AnalyticControllerObjective[source]¶
Bases:
ControllerObjectiveClosed-form gradients based on entropy/KL targets.
- name = 'analytic'¶
- compute(meta_ctx)[source]¶
- Parameters:
meta_ctx (ControllerMetaContext)
- Return type:
ControllerGradients | None
- class maxent_grpo.training.TruncatedBackpropControllerObjective(steps=1)[source]¶
Bases:
ControllerObjectiveTruncated meta-gradient objective relying on a user-supplied callback.
- Parameters:
steps (int)
- name = 'truncated_backprop'¶
- compute(meta_ctx)[source]¶
- Parameters:
meta_ctx (ControllerMetaContext)
- Return type:
ControllerGradients | None
- maxent_grpo.training.run_baseline_training(script_args, training_args, model_args)[source]¶
Entrypoint that loads data/model, builds trainer, and runs GRPO.
The function also performs a small eval subsample for speed if
training_args.do_evalis enabled and an eval split exists.- Parameters:
script_args (GRPOScriptArguments) – Script configuration including dataset and rewards.
training_args (GRPOConfig) – GRPO trainer arguments from TRL.
model_args (
trl.ModelConfig) – Model configuration for TRL/transformers.
- Returns:
None. Side effects include training, evaluation, and checkpointing.- Return type:
None
Modules
Minimal GRPO training entrypoint built on TRL. |
|
Training-specific CLI helpers (TRL argument parsing, etc.). |
|
Meta-controller objectives for tau/beta adaptation. |
|
Meta-optimizer orchestration for controller updates. |
|
Dataset loading helpers for the training pipeline. |
|
Validation helpers for the MaxEnt-GRPO training loop. |
|
Copyright 2025 Liv d'Aliberti |
|
Metrics and logging helpers for the MaxEnt-GRPO training loop. |
|
Copyright 2025 Liv d'Aliberti |
|
Training-time integration patches. |
|
Helpers for preparing generation/scoring artifacts used by the training loop. |
|
Reward and generation helpers extracted from the training loop. |
|
Copyright 2025 Liv d'Aliberti |
|
Shared helper utilities for the MaxEnt-GRPO training pipeline. |
|
Runtime utilities split by concern for the MaxEnt-GRPO training stack. |
|
Compatibility facade for scoring helpers. |
|
Batch construction and slice materialization helpers for scoring. |
|
Scoring helpers extracted from the MaxEnt-GRPO training loop. |
|
Model logprob computation and sequence-score assembly helpers. |
|
Reference-logprob and vLLM metadata scoring helpers. |
|
Trainer callback for official SEED paper-style eval against the live vLLM server. |
|
Training loop state helpers for controller and checkpoint management. |
|
Copyright 2025 Liv d'Aliberti |
|
Trainer helper hooks used by the active TRL/HF training path. |
|
Custom TRL GRPOTrainer wrapper used by the MaxEnt-GRPO pipelines. |
|
Copyright 2025 Liv d'Aliberti |
|
Copyright 2025 Liv d'Aliberti |
|
Utilities to safely integrate DeepSpeed ZeRO with optional dependencies. |