OAT Upstream DR.GRPO ==================== This guide documents the exact upstream OAT/``understand-r1-zero`` stack that reproduced the public README-style 1.5B DR.GRPO run in this repository, plus the opt-in listwise MaxEnt explorer overlay built directly on top of it. Working Baseline ---------------- The proven baseline path is: - launcher: ``ops/run_oat_zero_exact_1p5b_upstream.sh`` - node302 Slurm wrapper: ``ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm`` - upstream checkout: ``understand-r1-zero/`` - Python env: ``var/seed_paper_eval/paper310`` - flash-attn: installed on the node at launch time into a local overlay The README-flash runtime that worked is: - ``python==3.10.20`` - ``torch==2.6.0`` - ``transformers==4.51.3`` - ``vllm==0.8.4`` - ``oat-llm==0.1.3.post1`` - ``deepspeed==0.16.8`` - ``flash-attn==2.7.4.post1`` via the runtime overlay The baseline training geometry is intentionally left unchanged: - ``objective=grpo`` - ``critic_type=drgrpo`` - ``prompt_template=r1`` - ``num_samples=8`` - ``train_batch_size=128`` - ``train_batch_size_per_device=1`` - ``rollout_batch_size=128`` - ``rollout_batch_size_per_device=16`` - ``pi_buffer_maxlen_per_device=128`` - ``beta=0.0`` Explorer Overlay ---------------- The listwise explorer path reuses that exact runtime and launcher, but switches the learner onto the listwise MaxEnt objective through environment variables. - launcher: ``ops/run_oat_zero_explorer_1p5b_upstream.sh`` - node302 Slurm wrapper: ``ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm`` The explorer defaults mirror the existing Dr.GRPO-Explorer settings already used elsewhere in this repository: - ``objective=maxent_listwise`` - ``beta=0.0`` by default; set it above zero only when you explicitly want the reference-weight term active - ``maxent_tau=0.5`` - ``maxent_q_temperature=2.0`` - ``maxent_q_epsilon=1e-6`` - ``maxent_length_normalize_ref=true`` - ``maxent_length_normalize_policy=true`` - ``maxent_listwise_skip_zero_variance_groups=true`` - ``maxent_use_clip_objective=true`` - ``maxent_clip_objective_coef=1.0`` - ``maxent_reference_logprobs_source=model`` so a nonzero ``beta`` can reuse the model-reference path without changing the launcher - ``maxent_logprob_chunk_size=2`` - ``train_batch_size_per_device=8`` so each microbatch contains one whole prompt group when ``num_samples=8`` Safety Guardrails ----------------- The baseline DR.GRPO path stays the default. Nothing switches to listwise unless the launch explicitly sets ``OAT_ZERO_OBJECTIVE=maxent_listwise``. Additional guardrails are enforced in two places: - shell launcher validation in ``ops/run_oat_zero_exact_1p5b_upstream.sh`` - learner validation in ``understand-r1-zero/train_zero_math.py`` - repo audit in ``tools/audit_oat_setup.py`` The listwise path now fails fast when: - ``objective`` is unsupported - ``critic_type`` is not ``drgrpo`` - ``num_samples <= 1`` - ``train_batch_size_per_device`` is not divisible by ``num_samples`` - ``maxent_tau <= 0`` Implementation Notes -------------------- The local listwise extension lives in ``oat_zero_ext/listwise.py`` and is used only by the patched ``understand-r1-zero/train_zero_math.py`` learner. Key design choice: the baseline OAT PPO/Dr.GRPO update path is untouched. The listwise branch activates only inside ``ZeroMathLearner.learning_step`` when ``objective=maxent_listwise`` and keeps whole prompt groups together during minibatching. That prompt-group preservation matters because stock OAT shuffles flat rollout rows, which is correct for GRPO but wrong for a listwise prompt-group loss. Guarantees And Limits --------------------- This setup is engineered to avoid breaking the proven baseline: - one shared runtime/bootstrap path - separate baseline and explorer wrappers - separate default run names and local scratch roots - opt-in objective routing - focused unit tests for the prompt-group and q/weight math It is still not possible to promise an absolute guarantee against every future cluster, dependency, or upstream behavior change. What we do guarantee here is that the repository now has an isolated baseline path, an isolated explorer overlay, explicit validation, and targeted tests for the failure modes that would silently corrupt the objective.