OAT Upstream DR.GRPO
====================

This guide documents the exact upstream OAT/``understand-r1-zero`` stack that
reproduced the public README-style 1.5B DR.GRPO run in this repository, plus
the opt-in listwise MaxEnt explorer overlay built directly on top of it.

Working Baseline
----------------

The proven baseline path is:

- launcher: ``ops/run_oat_zero_exact_1p5b_upstream.sh``
- node302 Slurm wrapper:
  ``ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm``
- upstream checkout: ``understand-r1-zero/``
- Python env: ``var/seed_paper_eval/paper310``
- flash-attn: installed on the node at launch time into a local overlay

The README-flash runtime that worked is:

- ``python==3.10.20``
- ``torch==2.6.0``
- ``transformers==4.51.3``
- ``vllm==0.8.4``
- ``oat-llm==0.1.3.post1``
- ``deepspeed==0.16.8``
- ``flash-attn==2.7.4.post1`` via the runtime overlay

The baseline training geometry is intentionally left unchanged:

- ``objective=grpo``
- ``critic_type=drgrpo``
- ``prompt_template=r1``
- ``num_samples=8``
- ``train_batch_size=128``
- ``train_batch_size_per_device=1``
- ``rollout_batch_size=128``
- ``rollout_batch_size_per_device=16``
- ``pi_buffer_maxlen_per_device=128``
- ``beta=0.0``

Explorer Overlay
----------------

The listwise explorer path reuses that exact runtime and launcher, but switches
the learner onto the listwise MaxEnt objective through environment variables.

- launcher: ``ops/run_oat_zero_explorer_1p5b_upstream.sh``
- node302 Slurm wrapper:
  ``ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm``

The explorer defaults mirror the existing Dr.GRPO-Explorer settings already
used elsewhere in this repository:

- ``objective=maxent_listwise``
- ``beta=0.0`` by default; set it above zero only when you explicitly want the
  reference-weight term active
- ``maxent_tau=0.5``
- ``maxent_q_temperature=2.0``
- ``maxent_q_epsilon=1e-6``
- ``maxent_length_normalize_ref=true``
- ``maxent_length_normalize_policy=true``
- ``maxent_listwise_skip_zero_variance_groups=true``
- ``maxent_use_clip_objective=true``
- ``maxent_clip_objective_coef=1.0``
- ``maxent_reference_logprobs_source=model`` so a nonzero ``beta`` can reuse the
  model-reference path without changing the launcher
- ``maxent_logprob_chunk_size=2``
- ``train_batch_size_per_device=8`` so each microbatch contains one whole
  prompt group when ``num_samples=8``

Safety Guardrails
-----------------

The baseline DR.GRPO path stays the default. Nothing switches to listwise
unless the launch explicitly sets ``OAT_ZERO_OBJECTIVE=maxent_listwise``.

Additional guardrails are enforced in two places:

- shell launcher validation in ``ops/run_oat_zero_exact_1p5b_upstream.sh``
- learner validation in ``understand-r1-zero/train_zero_math.py``
- repo audit in ``tools/audit_oat_setup.py``

The listwise path now fails fast when:

- ``objective`` is unsupported
- ``critic_type`` is not ``drgrpo``
- ``num_samples <= 1``
- ``train_batch_size_per_device`` is not divisible by ``num_samples``
- ``maxent_tau <= 0``

Implementation Notes
--------------------

The local listwise extension lives in ``oat_zero_ext/listwise.py`` and is used
only by the patched ``understand-r1-zero/train_zero_math.py`` learner.

Key design choice: the baseline OAT PPO/Dr.GRPO update path is untouched. The
listwise branch activates only inside ``ZeroMathLearner.learning_step`` when
``objective=maxent_listwise`` and keeps whole prompt groups together during
minibatching.

That prompt-group preservation matters because stock OAT shuffles flat rollout
rows, which is correct for GRPO but wrong for a listwise prompt-group loss.

Guarantees And Limits
---------------------

This setup is engineered to avoid breaking the proven baseline:

- one shared runtime/bootstrap path
- separate baseline and explorer wrappers
- separate default run names and local scratch roots
- opt-in objective routing
- focused unit tests for the prompt-group and q/weight math

It is still not possible to promise an absolute guarantee against every future
cluster, dependency, or upstream behavior change. What we do guarantee here is
that the repository now has an isolated baseline path, an isolated explorer
overlay, explicit validation, and targeted tests for the failure modes that
would silently corrupt the objective.