maxent_grpo.training

Core helpers powering the MaxEnt-GRPO trainer.

class maxent_grpo.training.AnalyticControllerObjective[source]

Bases: ControllerObjective

Closed-form gradients based on entropy/KL targets.

name = 'analytic'
compute(meta_ctx)[source]
Parameters:

meta_ctx (ControllerMetaContext)

Return type:

ControllerGradients | None

class maxent_grpo.training.TruncatedBackpropControllerObjective(steps=1)[source]

Bases: ControllerObjective

Truncated meta-gradient objective relying on a user-supplied callback.

Parameters:

steps (int)

name = 'truncated_backprop'
compute(meta_ctx)[source]
Parameters:

meta_ctx (ControllerMetaContext)

Return type:

ControllerGradients | None

maxent_grpo.training.run_baseline_training(script_args, training_args, model_args)[source]

Entrypoint that loads data/model, builds trainer, and runs GRPO.

The function also performs a small eval subsample for speed if training_args.do_eval is enabled and an eval split exists.

Parameters:
  • script_args (GRPOScriptArguments) – Script configuration including dataset and rewards.

  • training_args (GRPOConfig) – GRPO trainer arguments from TRL.

  • model_args (trl.ModelConfig) – Model configuration for TRL/transformers.

Returns:

None. Side effects include training, evaluation, and checkpointing.

Return type:

None

Modules

baseline

Minimal GRPO training entrypoint built on TRL.

cli

Training-specific CLI helpers (TRL argument parsing, etc.).

controller_objective

Meta-controller objectives for tau/beta adaptation.

controller_optimizer

Meta-optimizer orchestration for controller updates.

data

Dataset loading helpers for the training pipeline.

eval

Validation helpers for the MaxEnt-GRPO training loop.

generation

Copyright 2025 Liv d'Aliberti

metrics

Metrics and logging helpers for the MaxEnt-GRPO training loop.

optim

Copyright 2025 Liv d'Aliberti

patches

Training-time integration patches.

pipeline

Helpers for preparing generation/scoring artifacts used by the training loop.

rewards

Reward and generation helpers extracted from the training loop.

rollout

Copyright 2025 Liv d'Aliberti

run_helpers

Shared helper utilities for the MaxEnt-GRPO training pipeline.

runtime

Runtime utilities split by concern for the MaxEnt-GRPO training stack.

scoring

Compatibility facade for scoring helpers.

scoring_batching

Batch construction and slice materialization helpers for scoring.

scoring_common

Scoring helpers extracted from the MaxEnt-GRPO training loop.

scoring_logprob

Model logprob computation and sequence-score assembly helpers.

scoring_reference

Reference-logprob and vLLM metadata scoring helpers.

seed_paper_eval_callback

Trainer callback for official SEED paper-style eval against the live vLLM server.

state

Training loop state helpers for controller and checkpoint management.

telemetry

Copyright 2025 Liv d'Aliberti

trainer_hooks

Trainer helper hooks used by the active TRL/HF training path.

trl_trainer

Custom TRL GRPOTrainer wrapper used by the MaxEnt-GRPO pipelines.

types

Copyright 2025 Liv d'Aliberti

weighting

Copyright 2025 Liv d'Aliberti

zero_utils

Utilities to safely integrate DeepSpeed ZeRO with optional dependencies.