Training Architecture¶
The training stack now uses one shared path for both GRPO and MaxEnt runs.
The only intended objective-level divergence is inside
maxent_grpo.training.trl_trainer.
Flow Diagram¶
CLI / Slurm Entrypoint
(``src/maxent_grpo/grpo.py``)
│
▼
Shared Training Pipeline
(``maxent_grpo.training.baseline``)
│
▼
Dataset + Prompt Mapping
(shared ``prompt``/``answer`` transform)
│
▼
Reward Resolution
(``maxent_grpo.training.rewards``)
│
▼
Objective + Loss
(``maxent_grpo.training.trl_trainer``)
│
▼
TRL/HF Optimization + Checkpointing
Stage Breakdown¶
- Entrypoint
src/maxent_grpo/grpo.pyis the canonical trainer entrypoint used for both GRPO and MaxEnt variants.- Shared Pipeline
maxent_grpo.training.baseline.run_baseline_training()performs all shared setup (dataset loading, prompt mapping, tokenizer/model, trainer wiring, train/eval, save/resume).- Rewards
maxent_grpo.training.rewardsresolves reward functions and weights with identical logic across both objectives.- Objective Boundary
maxent_grpo.training.trl_trainercontains the objective-specific behavior (GRPO vs MaxEnt). Keep divergence localized here for fair comparisons.