Method Identity¶
This repo now treats method selection as two explicit axes instead of one overloaded knob:
Algorithm family: controlled by
training.objectiveplustraining.seed_grpo_enabled.Loss backend: controlled by
training.grpo_loss_type.
That separation matters because objective: grpo does not tell you
whether a run is plain GRPO, BNPO-style GRPO, Dr.GRPO, or SEED-GRPO on top of
Dr.GRPO.
Current 1.5B Math Presets¶
Preset |
Family |
|
|
|
Canonical label |
|---|---|---|---|---|---|
|
Baseline GRPO |
|
|
|
|
|
Entropy MaxEnt |
|
|
|
|
|
Listwise MaxEnt |
|
|
|
|
|
SEED-GRPO |
|
|
|
|
The filename grpo_custom_math.yaml is historical. In the current 1.5B math
setup, it is the baseline Dr.GRPO preset because it pins
grpo_loss_type: dr_grpo.
Source of Truth¶
src/maxent_grpo/objectives.py: normalizes the top-level objective family.src/maxent_grpo/methods.py: resolves the final method identity from family + backend.src/maxent_grpo/config/grpo.py: validates and normalizesgrpo_loss_typeand the family-selection flags.src/maxent_grpo/training/trl_trainer.py: logs the resolved method at trainer startup.src/maxent_grpo/training/runtime/logging.py: writesrun/method_name,run/method_family,run/method_backend, andrun/method_sluginto run metadata/W&B config.
Family-Specific Code¶
Baseline GRPO / Dr.GRPO backend:
src/maxent_grpo/training/trl_trainer.pySEED-GRPO advantage scaling:
src/maxent_grpo/training/rewards.pyEntropy MaxEnt objective:
src/maxent_grpo/training/trl_trainer.pyListwise MaxEnt objective + tau/q/beta weighting:
src/maxent_grpo/training/trl_trainer.pyandsrc/maxent_grpo/training/weighting/logic.py
Recommended Convention¶
For reproducibility, always record both axes together:
family:
grpo/seed_grpo/maxent_entropy/maxent_listwisebackend:
grpo/bnpo/dr_grpo
In practice, use the runtime metadata fields rather than inferring from filenames alone:
run/method_namerun/method_familyrun/method_backendrun/method_slug