Training
========

The canonical training path in this repository is now the upstream OAT
README-flash stack. The only active training launchers under ``ops/`` are the
baseline DR.GRPO path and the listwise maxent-explorer overlay on top of that
same stack.

Canonical Launchers
-------------------

Baseline DR.GRPO:

- ``ops/run_oat_zero_exact_1p5b_upstream.sh``
- ``ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm``

Listwise maxent-explorer:

- ``ops/run_oat_zero_explorer_1p5b_upstream.sh``
- ``ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm``

Use the baseline wrapper for the exact README-flash OAT setup:

.. code-block:: bash

   sbatch ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm

Use the explorer wrapper for the listwise overlay on the same runtime:

.. code-block:: bash

   sbatch ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm

Shared Runtime
--------------

Both launchers share the same runtime and bootstrap path:

- upstream checkout: ``understand-r1-zero/``
- local python env: ``var/seed_paper_eval/paper310``
- launch-time flash-attn overlay
- local extension module: ``oat_zero_ext/listwise.py``
- patched learner: ``understand-r1-zero/train_zero_math.py``

Before launching, validate the runtime:

.. code-block:: bash

   python tools/audit_oat_setup.py

Objective Routing
-----------------

The baseline path stays on native OAT DR.GRPO.

- ``objective=grpo``
- ``critic_type=drgrpo``
- ``prompt_template=r1``

The explorer path is opt-in and only changes the learner objective:

- ``objective=maxent_listwise``
- ``beta=0.0`` by default; raise it only if you want the reference-weight term
  active
- ``maxent_tau=0.5``
- ``maxent_q_temperature=2.0``
- ``train_batch_size_per_device=8`` so prompt groups stay intact

The shared launcher validates listwise-only constraints before training starts,
and the learner validates them again inside the OAT process. See
:doc:`oat-upstream-drgrpo` for the full set of guardrails.

What Changed
------------

The active training surface under ``ops/`` has been reduced to the working OAT
stack only. Older experiment orchestration has been retired from the active
surface and moved under ``archive/trl/``.

Archived material includes:

- older TRL/Hydra orchestration wrappers
- retired Slurm launchers for those wrappers
- pre-canonical pure OAT launchers that used older layouts or assumptions

Archive Location
----------------

- ``archive/trl/ops/``
- ``archive/trl/ops/slurm/``

Those files are preserved for reference, but they are no longer the canonical
way to launch training from this repository.