Overview

The canonical training path in this repository is now the upstream OAT README-flash stack plus a local listwise maxent-explorer overlay.

Active baseline launcher:

  • ops/run_oat_zero_exact_1p5b_upstream.sh

  • ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm

Active explorer launcher:

  • ops/run_oat_zero_explorer_1p5b_upstream.sh

  • ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm

Retired TRL/Hydra orchestration and older noncanonical launchers are archived under archive/trl/.

Canonical Runtime

The working runtime is the repo-local paper310 environment:

  • python==3.10.20

  • torch==2.6.0

  • transformers==4.51.3

  • vllm==0.8.4

  • oat-llm==0.1.3.post1

  • deepspeed==0.16.8

  • flash-attn==2.7.4.post1 via the launch-time overlay

Validate it before training:

python tools/audit_oat_setup.py

Quickstart

  1. Launch the canonical baseline:

sbatch ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_node302.slurm
  1. Launch the listwise maxent-explorer variant on the same stack:

sbatch ops/slurm/train_understand_r1_zero_qwen2p5_math_1p5b_r1_readme_flash_explorer_node302.slurm
  1. For local shell launches instead of Slurm:

bash ops/run_oat_zero_exact_1p5b_upstream.sh
bash ops/run_oat_zero_explorer_1p5b_upstream.sh

What Is Archived

  • archive/trl/ops/ keeps retired orchestration wrappers and experiment launchers.

  • archive/trl/ops/slurm/ keeps retired Slurm entrypoints.

  • src/maxent_grpo/ remains in the repo for reference and historical work, but it is not the canonical training front door anymore.