Training Architecture ===================== The training stack now uses one shared path for both GRPO and MaxEnt runs. The only intended objective-level divergence is inside ``maxent_grpo.training.trl_trainer``. Flow Diagram ------------ .. code-block:: text CLI / Slurm Entrypoint (``src/maxent_grpo/grpo.py``) │ ▼ Shared Training Pipeline (``maxent_grpo.training.baseline``) │ ▼ Dataset + Prompt Mapping (shared ``prompt``/``answer`` transform) │ ▼ Reward Resolution (``maxent_grpo.training.rewards``) │ ▼ Objective + Loss (``maxent_grpo.training.trl_trainer``) │ ▼ TRL/HF Optimization + Checkpointing Stage Breakdown --------------- Entrypoint ``src/maxent_grpo/grpo.py`` is the canonical trainer entrypoint used for both GRPO and MaxEnt variants. Shared Pipeline :func:`maxent_grpo.training.baseline.run_baseline_training` performs all shared setup (dataset loading, prompt mapping, tokenizer/model, trainer wiring, train/eval, save/resume). Rewards :mod:`maxent_grpo.training.rewards` resolves reward functions and weights with identical logic across both objectives. Objective Boundary :mod:`maxent_grpo.training.trl_trainer` contains the objective-specific behavior (GRPO vs MaxEnt). Keep divergence localized here for fair comparisons.