Training¶

The canonical training path in this repository is now the upstream OAT README-flash stack. The only active training launchers under ops/ are the baseline DR.GRPO path and the listwise maxent-explorer overlay on top of that same stack.

Canonical Launchers¶

Baseline DR.GRPO:

ops/run_oat_zero_exact_1p5b_upstream.sh
ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm

Listwise maxent-explorer:

ops/run_oat_zero_explorer_1p5b_upstream.sh
ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm

Use the baseline wrapper for the exact README-flash OAT setup:

sbatch ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm

Use the explorer wrapper for the listwise overlay on the same runtime:

sbatch ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm

Shared Runtime¶

Both launchers share the same runtime and bootstrap path:

upstream checkout: understand-r1-zero/
local python env: var/seed_paper_eval/paper310
launch-time flash-attn overlay
local extension module: oat_zero_ext/listwise.py
patched learner: understand-r1-zero/train_zero_math.py

Before launching, validate the runtime:

python tools/audit_oat_setup.py

Objective Routing¶

The baseline path stays on native OAT DR.GRPO.

objective=grpo
critic_type=drgrpo
prompt_template=r1

The explorer path is opt-in and only changes the learner objective:

objective=maxent_listwise
beta=0.0 by default; raise it only if you want the reference-weight term active
maxent_tau=0.5
maxent_q_temperature=2.0
train_batch_size_per_device=8 so prompt groups stay intact

The shared launcher validates listwise-only constraints before training starts, and the learner validates them again inside the OAT process. See OAT Upstream DR.GRPO for the full set of guardrails.

What Changed¶

The active training surface under ops/ has been reduced to the working OAT stack only. Older experiment orchestration has been retired from the active surface and moved under archive/trl/.

Archived material includes:

older TRL/Hydra orchestration wrappers
retired Slurm launchers for those wrappers
pre-canonical pure OAT launchers that used older layouts or assumptions

Archive Location¶

archive/trl/ops/
archive/trl/ops/slurm/

Those files are preserved for reference, but they are no longer the canonical way to launch training from this repository.