maxent_grpo.training.types.runtime

Runtime handles and configuration dataclasses for the training loop.

Classes

BatchingSettings(logprob_chunk_size, score_slice)

Scoring batch/chunk hints.

ClipSettings(clip_range, use_clip_objective, ...)

PPO-style clipping configuration.

ControllerPaths(state_path, resume_from[, ...])

Filesystem locations for adaptive controller state.

EvaluationSettings(enabled, rows, ...[, ...])

Optional evaluation loop configuration.

GenerationFn(*args, **kwargs)

GenerationSettings(max_prompt_len, ...[, ...])

Configuration for sampling completions with penalty passthrough helpers.

LoopSettings(generation, evaluation, ...[, ...])

Grouped training configuration shared across the loop.

OptimizationSchedule(num_epochs, ...[, ...])

Epoch/step configuration for the trainer.

OptimizationSettings(schedule, handles)

Combined optimization metadata.

OptimizerHandles(optimizer, lr_scheduler, ...)

Pointers to optimizers and schedulers.

RewardSpec(reward_funcs, reward_weights)

Reward functions and their aggregation weights.

RuntimeHandles(accelerator, model, ...[, ...])

Pointers to objects that should live for the entire training job.

ScoringSettings(weighting, clipping, batching)

Weights, clipping, and scoring related knobs.

StepBatchInfo(epoch, step_in_epoch, batch)

Metadata describing the in-progress batch.

StepResources(generator, validation_ctx)

Reusable handles for each train step.

TrainingLoopContext(runtime, reward, ...[, ...])

Top-level container describing the full training job.

TrainingLoopState([global_step, ...])

Mutable counters that track training progress across steps.

class maxent_grpo.training.types.runtime.Accelerator

Bases: object

class maxent_grpo.training.types.runtime.BatchingSettings(logprob_chunk_size, score_slice, prompt_length_cache_get=None, score_tail_tokens=None, slice_prefetch=0, prompt_cache_size=0)[source]

Bases: object

Scoring batch/chunk hints.

Parameters:
logprob_chunk_size: int
score_slice: int
prompt_length_cache_get: Callable[[str], PromptCacheEntry] | None = None
score_tail_tokens: int | None = None
slice_prefetch: int = 0
prompt_cache_size: int = 0
class maxent_grpo.training.types.runtime.ClipSettings(clip_range, use_clip_objective, clip_objective_coef, clip_adv_baseline, clip_range_high=None, clip_delta=None)[source]

Bases: object

PPO-style clipping configuration.

Parameters:
  • clip_range (float)

  • use_clip_objective (bool)

  • clip_objective_coef (float)

  • clip_adv_baseline (float | None)

  • clip_range_high (float | None)

  • clip_delta (float | None)

clip_range: float
use_clip_objective: bool
clip_objective_coef: float
clip_adv_baseline: float | None
clip_range_high: float | None = None
clip_delta: float | None = None
class maxent_grpo.training.types.runtime.ControllerPaths(state_path, resume_from, overwrite_existing=False)[source]

Bases: object

Filesystem locations for adaptive controller state.

Parameters:
  • state_path (str | None)

  • resume_from (str | None)

  • overwrite_existing (bool)

state_path: str | None
resume_from: str | None
overwrite_existing: bool = False
class maxent_grpo.training.types.runtime.EvaluationSettings(enabled, rows, batch_size, every_n_steps, seed_eval=None)[source]

Bases: object

Optional evaluation loop configuration.

Parameters:
enabled: bool
rows: List[Dict[str, str]]
batch_size: int
every_n_steps: int | None
seed_eval: Dict[str, Any] | None = None
class maxent_grpo.training.types.runtime.GenerationFn(*args, **kwargs)[source]

Bases: Protocol[LogprobT]

class maxent_grpo.training.types.runtime.GenerationSettings(max_prompt_len, max_completion_len, gen_temperature, gen_top_p, use_vllm, vllm, penalty=<factory>, generation_stats=<factory>, *, vllm_mode='server')[source]

Bases: GenerationPenaltyPassthroughMixin, GenerationSamplingConfig

Configuration for sampling completions with penalty passthrough helpers.

Parameters:
penalty: GenerationPenaltyConfig
generation_stats: Dict[str, int]
class maxent_grpo.training.types.runtime.LoopSettings(generation, evaluation, optimization, scoring, controller, controller_objective=None, controller_meta_manager=None)[source]

Bases: object

Grouped training configuration shared across the loop.

Parameters:
generation: GenerationSettings
evaluation: EvaluationSettings
optimization: OptimizationSettings
scoring: ScoringSettings
controller: ControllerPaths
controller_objective: 'ControllerObjective' | None = None
controller_meta_manager: Any | None = None
class maxent_grpo.training.types.runtime.OptimizerHandles(optimizer, lr_scheduler, base_optimizer, learning_rate)[source]

Bases: object

Pointers to optimizers and schedulers.

Parameters:
  • optimizer (torch.optim.Optimizer)

  • lr_scheduler (Any | None)

  • base_optimizer (torch.optim.Optimizer)

  • learning_rate (float)

optimizer: torch.optim.Optimizer
lr_scheduler: Any | None
base_optimizer: torch.optim.Optimizer
learning_rate: float
class maxent_grpo.training.types.runtime.OptimizationSchedule(num_epochs, num_generations, grad_accum_steps, max_grad_norm, steps_per_epoch, total_training_steps, warmup_steps, lr_scheduler_type='cosine')[source]

Bases: object

Epoch/step configuration for the trainer.

Parameters:
  • num_epochs (int)

  • num_generations (int)

  • grad_accum_steps (int)

  • max_grad_norm (float)

  • steps_per_epoch (int | None)

  • total_training_steps (int)

  • warmup_steps (int)

  • lr_scheduler_type (str)

num_epochs: int
num_generations: int
grad_accum_steps: int
max_grad_norm: float
steps_per_epoch: int | None
total_training_steps: int
warmup_steps: int
lr_scheduler_type: str = 'cosine'
class maxent_grpo.training.types.runtime.OptimizationSettings(schedule, handles)[source]

Bases: object

Combined optimization metadata.

Parameters:
schedule: OptimizationSchedule
handles: OptimizerHandles
class maxent_grpo.training.types.runtime.PreTrainedModel(*args, **kwargs)

Bases: _BaseStub

class maxent_grpo.training.types.runtime.RewardSpec(reward_funcs, reward_weights)[source]

Bases: object

Reward functions and their aggregation weights.

Parameters:
reward_funcs: Sequence[Any]
reward_weights: List[float]
class maxent_grpo.training.types.runtime.RuntimeHandles(accelerator, model, tokenizer, train_loader, train_sampler, device, get_ref_model, reference_model=None, prompt_cache_get=None)[source]

Bases: object

Pointers to objects that should live for the entire training job.

Parameters:
accelerator: Accelerator
model: PreTrainedModel
tokenizer: PreTrainedTokenizer
train_loader: DataLoader
train_sampler: Sampler | None
device: Device
get_ref_model: Callable[[], PreTrainedModel]
reference_model: PreTrainedModel | None = None
prompt_cache_get: Callable[[str], PromptCacheEntry] | None = None
class maxent_grpo.training.types.runtime.ScoringSettings(weighting, clipping, batching, reference_logprobs_source='auto', behavior_logprobs_source='model', allow_stale_reference_logprobs=False, trl_reference_scoring=False, policy_entropy_bonus_coef=0.0, policy_entropy=False, policy_entropy_mode='exact')[source]

Bases: object

Weights, clipping, and scoring related knobs.

Parameters:
weighting: WeightingSettings
clipping: ClipSettings
batching: BatchingSettings
reference_logprobs_source: str = 'auto'
behavior_logprobs_source: str = 'model'
allow_stale_reference_logprobs: bool = False
trl_reference_scoring: bool = False
policy_entropy_bonus_coef: float = 0.0
policy_entropy: bool = False
policy_entropy_mode: str = 'exact'
class maxent_grpo.training.types.runtime.TrainingLoopContext(runtime, reward, settings, logging, eval_reward=None, resume_checkpoint=None, resume_state=None, checkpoint_state_ref=None, training_args=None)[source]

Bases: object

Top-level container describing the full training job.

Parameters:
runtime: RuntimeHandles
reward: RewardSpec
settings: LoopSettings
logging: LoggingHandles
eval_reward: RewardSpec | None = None
resume_checkpoint: str | None = None
resume_state: Dict[str, Any] | None = None
checkpoint_state_ref: Dict[str, Any] | None = None
training_args: GRPOConfigType | None = None
property generation: GenerationSettings

Active generation settings.

Returns:

Generation configuration backing the loop.

Return type:

GenerationSettings

property evaluation: EvaluationSettings

Evaluation configuration.

Returns:

Evaluation scheduling and dataset pointers.

Return type:

EvaluationSettings

property optimization: OptimizationSettings

Optimization handles and schedule.

Returns:

Optimizer handles and schedule metadata.

Return type:

OptimizationSettings

property scoring: ScoringSettings

Scoring configuration.

Returns:

Scoring settings (weights, chunk sizes, etc.).

Return type:

ScoringSettings

property controller: ControllerPaths

Adaptive controller paths.

Returns:

Filesystem locations used by the adaptive controller.

Return type:

ControllerPaths

property controller_objective: 'ControllerObjective' | None

Return the configured controller objective, if any.

property controller_meta_manager: Any | None

Return the meta-controller manager, if configured.