OAT Upstream DR.GRPO¶

This guide documents the exact upstream OAT/understand-r1-zero stack that reproduced the public README-style 1.5B DR.GRPO run in this repository, plus the opt-in listwise MaxEnt explorer overlay built directly on top of it.

Working Baseline¶

The proven baseline path is:

launcher: ops/run_oat_zero_exact_1p5b_upstream.sh
node302 Slurm wrapper: ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm
upstream checkout: understand-r1-zero/
Python env: var/seed_paper_eval/paper310
flash-attn: installed on the node at launch time into a local overlay

The README-flash runtime that worked is:

python==3.10.20
torch==2.6.0
transformers==4.51.3
vllm==0.8.4
oat-llm==0.1.3.post1
deepspeed==0.16.8
flash-attn==2.7.4.post1 via the runtime overlay

The baseline training geometry is intentionally left unchanged:

objective=grpo
critic_type=drgrpo
prompt_template=r1
num_samples=8
train_batch_size=128
train_batch_size_per_device=1
rollout_batch_size=128
rollout_batch_size_per_device=16
pi_buffer_maxlen_per_device=128
beta=0.0

Explorer Overlay¶

The listwise explorer path reuses that exact runtime and launcher, but switches the learner onto the listwise MaxEnt objective through environment variables.

launcher: ops/run_oat_zero_explorer_1p5b_upstream.sh
node302 Slurm wrapper: ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm

The explorer defaults mirror the existing Dr.GRPO-Explorer settings already used elsewhere in this repository:

objective=maxent_listwise
beta=0.0 by default; set it above zero only when you explicitly want the reference-weight term active
maxent_tau=0.5
maxent_q_temperature=2.0
maxent_q_epsilon=1e-6
maxent_length_normalize_ref=true
maxent_length_normalize_policy=true
maxent_listwise_skip_zero_variance_groups=true
maxent_use_clip_objective=true
maxent_clip_objective_coef=1.0
maxent_reference_logprobs_source=model so a nonzero beta can reuse the model-reference path without changing the launcher
maxent_logprob_chunk_size=2
train_batch_size_per_device=8 so each microbatch contains one whole prompt group when num_samples=8

Safety Guardrails¶

The baseline DR.GRPO path stays the default. Nothing switches to listwise unless the launch explicitly sets OAT_ZERO_OBJECTIVE=maxent_listwise.

Additional guardrails are enforced in two places:

shell launcher validation in ops/run_oat_zero_exact_1p5b_upstream.sh
learner validation in understand-r1-zero/train_zero_math.py
repo audit in tools/audit_oat_setup.py

The listwise path now fails fast when:

objective is unsupported
critic_type is not drgrpo
num_samples <= 1
train_batch_size_per_device is not divisible by num_samples
maxent_tau <= 0

Implementation Notes¶

The local listwise extension lives in oat_zero_ext/listwise.py and is used only by the patched understand-r1-zero/train_zero_math.py learner.

Key design choice: the baseline OAT PPO/Dr.GRPO update path is untouched. The listwise branch activates only inside ZeroMathLearner.learning_step when objective=maxent_listwise and keeps whole prompt groups together during minibatching.

That prompt-group preservation matters because stock OAT shuffles flat rollout rows, which is correct for GRPO but wrong for a listwise prompt-group loss.

Guarantees And Limits¶

This setup is engineered to avoid breaking the proven baseline:

one shared runtime/bootstrap path
separate baseline and explorer wrappers
separate default run names and local scratch roots
opt-in objective routing
focused unit tests for the prompt-group and q/weight math

It is still not possible to promise an absolute guarantee against every future cluster, dependency, or upstream behavior change. What we do guarantee here is that the repository now has an isolated baseline path, an isolated explorer overlay, explicit validation, and targeted tests for the failure modes that would silently corrupt the objective.