Training

The canonical training path in this repository is now the upstream OAT README-flash stack. The only active training launchers under ops/ are the baseline DR.GRPO path and the listwise maxent-explorer overlay on top of that same stack.

Canonical Launchers

Baseline DR.GRPO:

  • ops/run_oat_zero_exact_1p5b_upstream.sh

  • ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm

Listwise maxent-explorer:

  • ops/run_oat_zero_explorer_1p5b_upstream.sh

  • ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm

Use the baseline wrapper for the exact README-flash OAT setup:

sbatch ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm

Use the explorer wrapper for the listwise overlay on the same runtime:

sbatch ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm

Shared Runtime

Both launchers share the same runtime and bootstrap path:

  • upstream checkout: understand-r1-zero/

  • local python env: var/seed_paper_eval/paper310

  • launch-time flash-attn overlay

  • local extension module: oat_zero_ext/listwise.py

  • patched learner: understand-r1-zero/train_zero_math.py

Before launching, validate the runtime:

python tools/audit_oat_setup.py

Objective Routing

The baseline path stays on native OAT DR.GRPO.

  • objective=grpo

  • critic_type=drgrpo

  • prompt_template=r1

The explorer path is opt-in and only changes the learner objective:

  • objective=maxent_listwise

  • beta=0.0 by default; raise it only if you want the reference-weight term active

  • maxent_tau=0.5

  • maxent_q_temperature=2.0

  • train_batch_size_per_device=8 so prompt groups stay intact

The shared launcher validates listwise-only constraints before training starts, and the learner validates them again inside the OAT process. See OAT Upstream DR.GRPO for the full set of guardrails.

What Changed

The active training surface under ops/ has been reduced to the working OAT stack only. Older experiment orchestration has been retired from the active surface and moved under archive/trl/.

Archived material includes:

  • older TRL/Hydra orchestration wrappers

  • retired Slurm launchers for those wrappers

  • pre-canonical pure OAT launchers that used older layouts or assumptions

Archive Location

  • archive/trl/ops/

  • archive/trl/ops/slurm/

Those files are preserved for reference, but they are no longer the canonical way to launch training from this repository.