Training¶
The canonical training path in this repository is now the upstream OAT
README-flash stack. The only active training launchers under ops/ are the
baseline DR.GRPO path and the listwise maxent-explorer overlay on top of that
same stack.
Canonical Launchers¶
Baseline DR.GRPO:
ops/run_oat_zero_exact_1p5b_upstream.shops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm
Listwise maxent-explorer:
ops/run_oat_zero_explorer_1p5b_upstream.shops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm
Use the baseline wrapper for the exact README-flash OAT setup:
sbatch ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm
Use the explorer wrapper for the listwise overlay on the same runtime:
sbatch ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm
Objective Routing¶
The baseline path stays on native OAT DR.GRPO.
objective=grpocritic_type=drgrpoprompt_template=r1
The explorer path is opt-in and only changes the learner objective:
objective=maxent_listwisebeta=0.0by default; raise it only if you want the reference-weight term activemaxent_tau=0.5maxent_q_temperature=2.0train_batch_size_per_device=8so prompt groups stay intact
The shared launcher validates listwise-only constraints before training starts, and the learner validates them again inside the OAT process. See OAT Upstream DR.GRPO for the full set of guardrails.
What Changed¶
The active training surface under ops/ has been reduced to the working OAT
stack only. Older experiment orchestration has been retired from the active
surface and moved under archive/trl/.
Archived material includes:
older TRL/Hydra orchestration wrappers
retired Slurm launchers for those wrappers
pre-canonical pure OAT launchers that used older layouts or assumptions
Archive Location¶
archive/trl/ops/archive/trl/ops/slurm/
Those files are preserved for reference, but they are no longer the canonical way to launch training from this repository.