maxent_grpo.training¶

Core helpers powering the MaxEnt-GRPO trainer.

class maxent_grpo.training.AnalyticControllerObjective[source]¶

Bases: ControllerObjective

Closed-form gradients based on entropy/KL targets.

name = 'analytic'¶

compute(meta_ctx)[source]¶

Parameters:: meta_ctx (ControllerMetaContext)
Return type:: ControllerGradients | None

class maxent_grpo.training.TruncatedBackpropControllerObjective(steps=1)[source]¶

Bases: ControllerObjective

Truncated meta-gradient objective relying on a user-supplied callback.

Parameters:: steps (int)

name = 'truncated_backprop'¶

compute(meta_ctx)[source]¶

Parameters:: meta_ctx (ControllerMetaContext)
Return type:: ControllerGradients | None

maxent_grpo.training.run_baseline_training(script_args, training_args, model_args)[source]¶

Entrypoint that loads data/model, builds trainer, and runs GRPO.

The function also performs a small eval subsample for speed if training_args.do_eval is enabled and an eval split exists.

Parameters:

script_args (GRPOScriptArguments) – Script configuration including dataset and rewards.
training_args (GRPOConfig) – GRPO trainer arguments from TRL.
model_args (trl.ModelConfig) – Model configuration for TRL/transformers.

Returns:

None. Side effects include training, evaluation, and checkpointing.

Return type:

None

Modules

`baseline`	Minimal GRPO training entrypoint built on TRL.
`cli`	Training-specific CLI helpers (TRL argument parsing, etc.).
`controller_objective`	Meta-controller objectives for tau/beta adaptation.
`controller_optimizer`	Meta-optimizer orchestration for controller updates.
`data`	Dataset loading helpers for the training pipeline.
`eval`	Validation helpers for the MaxEnt-GRPO training loop.
`generation`	Copyright 2025 Liv d'Aliberti
`metrics`	Metrics and logging helpers for the MaxEnt-GRPO training loop.
`optim`	Copyright 2025 Liv d'Aliberti
`patches`	Training-time integration patches.
`pipeline`	Helpers for preparing generation/scoring artifacts used by the training loop.
`rewards`	Reward and generation helpers extracted from the training loop.
`rollout`	Copyright 2025 Liv d'Aliberti
`run_helpers`	Shared helper utilities for the MaxEnt-GRPO training pipeline.
`runtime`	Runtime utilities split by concern for the MaxEnt-GRPO training stack.
`scoring`	Compatibility facade for scoring helpers.
`scoring_batching`	Batch construction and slice materialization helpers for scoring.
`scoring_common`	Scoring helpers extracted from the MaxEnt-GRPO training loop.
`scoring_logprob`	Model logprob computation and sequence-score assembly helpers.
`scoring_reference`	Reference-logprob and vLLM metadata scoring helpers.
`seed_paper_eval_callback`	Trainer callback for official SEED paper-style eval against the live vLLM server.
`state`	Training loop state helpers for controller and checkpoint management.
`telemetry`	Copyright 2025 Liv d'Aliberti
`trainer_hooks`	Trainer helper hooks used by the active TRL/HF training path.
`trl_trainer`	Custom TRL GRPOTrainer wrapper used by the MaxEnt-GRPO pipelines.
`types`	Copyright 2025 Liv d'Aliberti
`weighting`	Copyright 2025 Liv d'Aliberti
`zero_utils`	Utilities to safely integrate DeepSpeed ZeRO with optional dependencies.