maxent_grpo.training.types.runtime¶
Runtime handles and configuration dataclasses for the training loop.
Classes
|
Scoring batch/chunk hints. |
|
PPO-style clipping configuration. |
|
Filesystem locations for adaptive controller state. |
|
Optional evaluation loop configuration. |
|
|
|
Configuration for sampling completions with penalty passthrough helpers. |
|
Grouped training configuration shared across the loop. |
|
Epoch/step configuration for the trainer. |
|
Combined optimization metadata. |
|
Pointers to optimizers and schedulers. |
|
Reward functions and their aggregation weights. |
|
Pointers to objects that should live for the entire training job. |
|
Weights, clipping, and scoring related knobs. |
|
Metadata describing the in-progress batch. |
|
Reusable handles for each train step. |
|
Top-level container describing the full training job. |
|
Mutable counters that track training progress across steps. |
- class maxent_grpo.training.types.runtime.BatchingSettings(logprob_chunk_size, score_slice, prompt_length_cache_get=None, score_tail_tokens=None, slice_prefetch=0, prompt_cache_size=0)[source]¶
Bases:
objectScoring batch/chunk hints.
- Parameters:
- prompt_length_cache_get: Callable[[str], PromptCacheEntry] | None = None¶
- class maxent_grpo.training.types.runtime.ClipSettings(clip_range, use_clip_objective, clip_objective_coef, clip_adv_baseline, clip_range_high=None, clip_delta=None)[source]¶
Bases:
objectPPO-style clipping configuration.
- Parameters:
- class maxent_grpo.training.types.runtime.ControllerPaths(state_path, resume_from, overwrite_existing=False)[source]¶
Bases:
objectFilesystem locations for adaptive controller state.
- class maxent_grpo.training.types.runtime.EvaluationSettings(enabled, rows, batch_size, every_n_steps, seed_eval=None)[source]¶
Bases:
objectOptional evaluation loop configuration.
- Parameters:
- class maxent_grpo.training.types.runtime.GenerationFn(*args, **kwargs)[source]¶
Bases:
Protocol[LogprobT]
- class maxent_grpo.training.types.runtime.GenerationSettings(max_prompt_len, max_completion_len, gen_temperature, gen_top_p, use_vllm, vllm, penalty=<factory>, generation_stats=<factory>, *, vllm_mode='server')[source]¶
Bases:
GenerationPenaltyPassthroughMixin,GenerationSamplingConfigConfiguration for sampling completions with penalty passthrough helpers.
- Parameters:
- penalty: GenerationPenaltyConfig¶
- class maxent_grpo.training.types.runtime.LoopSettings(generation, evaluation, optimization, scoring, controller, controller_objective=None, controller_meta_manager=None)[source]¶
Bases:
objectGrouped training configuration shared across the loop.
- Parameters:
generation (GenerationSettings)
evaluation (EvaluationSettings)
optimization (OptimizationSettings)
scoring (ScoringSettings)
controller (ControllerPaths)
controller_objective (Optional['ControllerObjective'])
controller_meta_manager (Optional[Any])
- generation: GenerationSettings¶
- evaluation: EvaluationSettings¶
- optimization: OptimizationSettings¶
- scoring: ScoringSettings¶
- controller: ControllerPaths¶
- class maxent_grpo.training.types.runtime.OptimizerHandles(optimizer, lr_scheduler, base_optimizer, learning_rate)[source]¶
Bases:
objectPointers to optimizers and schedulers.
- Parameters:
- optimizer: torch.optim.Optimizer¶
- base_optimizer: torch.optim.Optimizer¶
- class maxent_grpo.training.types.runtime.OptimizationSchedule(num_epochs, num_generations, grad_accum_steps, max_grad_norm, steps_per_epoch, total_training_steps, warmup_steps, lr_scheduler_type='cosine')[source]¶
Bases:
objectEpoch/step configuration for the trainer.
- Parameters:
- class maxent_grpo.training.types.runtime.OptimizationSettings(schedule, handles)[source]¶
Bases:
objectCombined optimization metadata.
- Parameters:
schedule (OptimizationSchedule)
handles (OptimizerHandles)
- schedule: OptimizationSchedule¶
- handles: OptimizerHandles¶
- class maxent_grpo.training.types.runtime.PreTrainedModel(*args, **kwargs)¶
Bases:
_BaseStub
- class maxent_grpo.training.types.runtime.RewardSpec(reward_funcs, reward_weights)[source]¶
Bases:
objectReward functions and their aggregation weights.
- class maxent_grpo.training.types.runtime.RuntimeHandles(accelerator, model, tokenizer, train_loader, train_sampler, device, get_ref_model, reference_model=None, prompt_cache_get=None)[source]¶
Bases:
objectPointers to objects that should live for the entire training job.
- Parameters:
accelerator (Accelerator)
model (PreTrainedModel)
tokenizer (PreTrainedTokenizer)
train_loader (maxent_grpo.training.types.runtime.DataLoader)
train_sampler (torch.utils.data.Sampler | None)
device (Any)
get_ref_model (Callable[[], PreTrainedModel])
reference_model (PreTrainedModel | None)
prompt_cache_get (Callable[[str], PromptCacheEntry] | None)
- accelerator: Accelerator¶
- model: PreTrainedModel¶
- tokenizer: PreTrainedTokenizer¶
- train_loader: DataLoader¶
- device: Device¶
- get_ref_model: Callable[[], PreTrainedModel]¶
- reference_model: PreTrainedModel | None = None¶
- prompt_cache_get: Callable[[str], PromptCacheEntry] | None = None¶
- class maxent_grpo.training.types.runtime.ScoringSettings(weighting, clipping, batching, reference_logprobs_source='auto', behavior_logprobs_source='model', allow_stale_reference_logprobs=False, trl_reference_scoring=False, policy_entropy_bonus_coef=0.0, policy_entropy=False, policy_entropy_mode='exact')[source]¶
Bases:
objectWeights, clipping, and scoring related knobs.
- Parameters:
weighting (WeightingSettings)
clipping (ClipSettings)
batching (BatchingSettings)
reference_logprobs_source (str)
behavior_logprobs_source (str)
allow_stale_reference_logprobs (bool)
trl_reference_scoring (bool)
policy_entropy_bonus_coef (float)
policy_entropy (bool)
policy_entropy_mode (str)
- weighting: WeightingSettings¶
- clipping: ClipSettings¶
- batching: BatchingSettings¶
- class maxent_grpo.training.types.runtime.TrainingLoopContext(runtime, reward, settings, logging, eval_reward=None, resume_checkpoint=None, resume_state=None, checkpoint_state_ref=None, training_args=None)[source]¶
Bases:
objectTop-level container describing the full training job.
- Parameters:
runtime (RuntimeHandles)
reward (RewardSpec)
settings (LoopSettings)
logging (LoggingHandles)
eval_reward (RewardSpec | None)
resume_checkpoint (str | None)
training_args (object | None)
- runtime: RuntimeHandles¶
- reward: RewardSpec¶
- settings: LoopSettings¶
- logging: LoggingHandles¶
- eval_reward: RewardSpec | None = None¶
- property generation: GenerationSettings¶
Active generation settings.
- Returns:
Generation configuration backing the loop.
- Return type:
- property evaluation: EvaluationSettings¶
Evaluation configuration.
- Returns:
Evaluation scheduling and dataset pointers.
- Return type:
- property optimization: OptimizationSettings¶
Optimization handles and schedule.
- Returns:
Optimizer handles and schedule metadata.
- Return type:
- property scoring: ScoringSettings¶
Scoring configuration.
- Returns:
Scoring settings (weights, chunk sizes, etc.).
- Return type:
- property controller: ControllerPaths¶
Adaptive controller paths.
- Returns:
Filesystem locations used by the adaptive controller.
- Return type: