OAT Upstream DR.GRPO¶
This guide documents the exact upstream OAT/understand-r1-zero stack that
reproduced the public README-style 1.5B DR.GRPO run in this repository, plus
the opt-in listwise MaxEnt explorer overlay built directly on top of it.
Working Baseline¶
The proven baseline path is:
launcher:
ops/run_oat_zero_exact_1p5b_upstream.shnode302 Slurm wrapper:
ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurmupstream checkout:
understand-r1-zero/Python env:
var/seed_paper_eval/paper310flash-attn: installed on the node at launch time into a local overlay
The README-flash runtime that worked is:
python==3.10.20torch==2.6.0transformers==4.51.3vllm==0.8.4oat-llm==0.1.3.post1deepspeed==0.16.8flash-attn==2.7.4.post1via the runtime overlay
The baseline training geometry is intentionally left unchanged:
objective=grpocritic_type=drgrpoprompt_template=r1num_samples=8train_batch_size=128train_batch_size_per_device=1rollout_batch_size=128rollout_batch_size_per_device=16pi_buffer_maxlen_per_device=128beta=0.0
Explorer Overlay¶
The listwise explorer path reuses that exact runtime and launcher, but switches the learner onto the listwise MaxEnt objective through environment variables.
launcher:
ops/run_oat_zero_explorer_1p5b_upstream.shnode302 Slurm wrapper:
ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm
The explorer defaults mirror the existing Dr.GRPO-Explorer settings already used elsewhere in this repository:
objective=maxent_listwisebeta=0.0by default; set it above zero only when you explicitly want the reference-weight term activemaxent_tau=0.5maxent_q_temperature=2.0maxent_q_epsilon=1e-6maxent_length_normalize_ref=truemaxent_length_normalize_policy=truemaxent_listwise_skip_zero_variance_groups=truemaxent_use_clip_objective=truemaxent_clip_objective_coef=1.0maxent_reference_logprobs_source=modelso a nonzerobetacan reuse the model-reference path without changing the launchermaxent_logprob_chunk_size=2train_batch_size_per_device=8so each microbatch contains one whole prompt group whennum_samples=8
Safety Guardrails¶
The baseline DR.GRPO path stays the default. Nothing switches to listwise
unless the launch explicitly sets OAT_ZERO_OBJECTIVE=maxent_listwise.
Additional guardrails are enforced in two places:
shell launcher validation in
ops/run_oat_zero_exact_1p5b_upstream.shlearner validation in
understand-r1-zero/train_zero_math.pyrepo audit in
tools/audit_oat_setup.py
The listwise path now fails fast when:
objectiveis unsupportedcritic_typeis notdrgrponum_samples <= 1train_batch_size_per_deviceis not divisible bynum_samplesmaxent_tau <= 0
Implementation Notes¶
The local listwise extension lives in oat_zero_ext/listwise.py and is used
only by the patched understand-r1-zero/train_zero_math.py learner.
Key design choice: the baseline OAT PPO/Dr.GRPO update path is untouched. The
listwise branch activates only inside ZeroMathLearner.learning_step when
objective=maxent_listwise and keeps whole prompt groups together during
minibatching.
That prompt-group preservation matters because stock OAT shuffles flat rollout rows, which is correct for GRPO but wrong for a listwise prompt-group loss.
Guarantees And Limits¶
This setup is engineered to avoid breaking the proven baseline:
one shared runtime/bootstrap path
separate baseline and explorer wrappers
separate default run names and local scratch roots
opt-in objective routing
focused unit tests for the prompt-group and q/weight math
It is still not possible to promise an absolute guarantee against every future cluster, dependency, or upstream behavior change. What we do guarantee here is that the repository now has an isolated baseline path, an isolated explorer overlay, explicit validation, and targeted tests for the failure modes that would silently corrupt the objective.