Method Identity =============== This repo now treats method selection as **two explicit axes** instead of one overloaded knob: - **Algorithm family**: controlled by ``training.objective`` plus ``training.seed_grpo_enabled``. - **Loss backend**: controlled by ``training.grpo_loss_type``. That separation matters because ``objective: grpo`` does **not** tell you whether a run is plain GRPO, BNPO-style GRPO, Dr.GRPO, or SEED-GRPO on top of Dr.GRPO. Current 1.5B Math Presets ------------------------- .. list-table:: :header-rows: 1 * - Preset - Family - ``objective`` - ``seed_grpo_enabled`` - ``grpo_loss_type`` - Canonical label * - ``configs/recipes/hydra/grpo_custom_math.yaml`` - Baseline GRPO - ``grpo`` - ``false`` - ``dr_grpo`` - ``Dr.GRPO`` * - ``configs/recipes/hydra/maxent_entropy_math.yaml`` - Entropy MaxEnt - ``maxent_entropy`` - ``false`` - ``dr_grpo`` - ``Entropy MaxEnt (Dr.GRPO loss)`` * - ``configs/recipes/hydra/maxent_listwise_math.yaml`` - Listwise MaxEnt - ``maxent_listwise`` - ``false`` - ``dr_grpo`` - ``Listwise MaxEnt (Dr.GRPO loss)`` * - ``configs/recipes/hydra/seed_grpo_math.yaml`` - SEED-GRPO - ``grpo`` - ``true`` - ``dr_grpo`` - ``SEED-GRPO (Dr.GRPO loss)`` The filename ``grpo_custom_math.yaml`` is historical. In the current 1.5B math setup, it is the baseline **Dr.GRPO** preset because it pins ``grpo_loss_type: dr_grpo``. Source of Truth --------------- - ``src/maxent_grpo/objectives.py``: normalizes the top-level objective family. - ``src/maxent_grpo/methods.py``: resolves the final method identity from family + backend. - ``src/maxent_grpo/config/grpo.py``: validates and normalizes ``grpo_loss_type`` and the family-selection flags. - ``src/maxent_grpo/training/trl_trainer.py``: logs the resolved method at trainer startup. - ``src/maxent_grpo/training/runtime/logging.py``: writes ``run/method_name``, ``run/method_family``, ``run/method_backend``, and ``run/method_slug`` into run metadata/W&B config. Family-Specific Code -------------------- - **Baseline GRPO / Dr.GRPO backend**: ``src/maxent_grpo/training/trl_trainer.py`` - **SEED-GRPO advantage scaling**: ``src/maxent_grpo/training/rewards.py`` - **Entropy MaxEnt objective**: ``src/maxent_grpo/training/trl_trainer.py`` - **Listwise MaxEnt objective + tau/q/beta weighting**: ``src/maxent_grpo/training/trl_trainer.py`` and ``src/maxent_grpo/training/weighting/logic.py`` Recommended Convention ---------------------- For reproducibility, always record both axes together: - family: ``grpo`` / ``seed_grpo`` / ``maxent_entropy`` / ``maxent_listwise`` - backend: ``grpo`` / ``bnpo`` / ``dr_grpo`` In practice, use the runtime metadata fields rather than inferring from filenames alone: - ``run/method_name`` - ``run/method_family`` - ``run/method_backend`` - ``run/method_slug``