Training Architecture

The training stack now uses one shared path for both GRPO and MaxEnt runs. The only intended objective-level divergence is inside maxent_grpo.training.trl_trainer.

Flow Diagram

 CLI / Slurm Entrypoint
 (``src/maxent_grpo/grpo.py``)
          │
          ▼
Shared Training Pipeline
(``maxent_grpo.training.baseline``)
          │
          ▼
Dataset + Prompt Mapping
(shared ``prompt``/``answer`` transform)
          │
          ▼
Reward Resolution
(``maxent_grpo.training.rewards``)
          │
          ▼
Objective + Loss
(``maxent_grpo.training.trl_trainer``)
          │
          ▼
TRL/HF Optimization + Checkpointing

Stage Breakdown

Entrypoint

src/maxent_grpo/grpo.py is the canonical trainer entrypoint used for both GRPO and MaxEnt variants.

Shared Pipeline

maxent_grpo.training.baseline.run_baseline_training() performs all shared setup (dataset loading, prompt mapping, tokenizer/model, trainer wiring, train/eval, save/resume).

Rewards

maxent_grpo.training.rewards resolves reward functions and weights with identical logic across both objectives.

Objective Boundary

maxent_grpo.training.trl_trainer contains the objective-specific behavior (GRPO vs MaxEnt). Keep divergence localized here for fair comparisons.