Recipes

Use YAML recipes to keep runs reproducible and readable. The configs mirror the dataclasses in maxent_grpo.config and TRL’s GRPO settings.

Recipe Layout

Recipes are flat YAML mappings. Fields are routed automatically into three config objects based on their dataclass field names:

  • GRPOScriptArguments (dataset, evaluation, reward-related script knobs)

  • GRPOConfig (training, MaxEnt, vLLM, logging)

  • TRL ModelConfig (model name, dtype, revision, etc.)

Keys that do not match script or training fields are forwarded to the TRL ModelConfig when possible. Any remaining keys are ignored.

Minimal example:

model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
dataset_name: open-r1/OpenR1-Math-220k
output_dir: var/data/out
objective: maxent_listwise
maxent_tau: 0.2

Loading and Validation

maxent_grpo.config.load_grpo_recipe loads YAML with OmegaConf (or PyYAML), sets GRPO_RECIPE_USED to the resolved path, and applies a few convenience rules:

  • When use_vllm: true and vllm_mode: server, missing vllm_server_base_url / host / port are inferred from vllm_url.

  • MAXENT_LOG_LEVEL overrides log_level in the training config.

For flat recipes (no top-level script / training / model keys), schema validation enforces:

  • Baseline recipes must set beta.

  • MaxEnt recipes must use objective: maxent_entropy or objective: maxent_listwise unless they opt into GRPO + entropy bonus via objective: grpo_entropy_bonus with policy_entropy_bonus_coef>0 under train-maxent.

Validation is skipped during tests and only applies to flat recipe files (not Hydra configs).

Dataset Mixtures

To blend multiple datasets, set dataset_mixture instead of dataset_name:

dataset_mixture:
  seed: 42
  test_split_size: 0.02
  datasets:
    - id: open-r1/OpenR1-Math-220k
      split: train
      columns: [problem, answer]
      weight: 1.0
    - id: some/other-dataset
      split: train
      weight: 0.5

Each dataset can define a split, optional columns, and a sampling weight. The mixture loader validates column consistency and can carve out a test split on the combined dataset.

Math GRPO (Qwen 1.5B)

  1# Copyright 2025 Liv d'Aliberti
  2#
  3# Licensed under the Apache License, Version 2.0 (the "License");
  4# you may not use this file except in compliance with the License.
  5# You may obtain a copy of the License at
  6#
  7#     http://www.apache.org/licenses/LICENSE-2.0
  8#
  9# Unless required by applicable law or agreed to in writing, software
 10# distributed under the License is distributed on an "AS IS" BASIS,
 11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12# See the License for the specific language governing permissions and
 13# limitations under the License.
 14
 15# Model arguments
 16model_name_or_path: Qwen/Qwen2.5-Math-1.5B
 17reference_model_name_or_path: Qwen/Qwen2.5-Math-1.5B
 18model_revision: main
 19resume_from_checkpoint: false
 20torch_dtype: float16
 21attn_implementation: sdpa
 22
 23# Data training arguments
 24dataset_name: axon-rl/MATH-lvl3to5-8k
 25dataset_config: default
 26dataset_prompt_column: problem
 27dataset_solution_column: answer
 28eval_dataset_name: aime24,amc,math_500,minerva,olympiad_bench
 29eval_dataset_split: test
 30eval_dataset_prompt_column: problem
 31eval_dataset_solution_column: answer
 32disable_distributed_sampler: true
 33prompt_template: "no"
 34system_prompt: null
 35chat_template: null
 36
 37# Dr.GRPO-style training/eval settings
 38bf16: true
 39fp16: false
 40use_vllm: true
 41vllm_mode: colocate
 42torch_compile: false
 43vllm_return_logprobs: true
 44vllm_request_logprobs: true
 45maxent_reference_logprobs_source: model
 46maxent_trl_reference_scoring: true
 47behavior_logprobs_source: model
 48# Score rollout batches in smaller slices to reduce peak logits memory without
 49# changing the exact loss computation.
 50maxent_logprob_chunk_size: 2
 51policy_entropy_bonus_coef: 0.0
 52maxent_length_normalize_ref: false
 53maxent_length_normalize_policy: false
 54maxent_policy_entropy: false
 55vllm_sync_weights: true
 56vllm_sync_interval_steps: 1
 57vllm_gpu_memory_utilization: 0.35
 58do_eval: false
 59gradient_accumulation_steps: 16
 60gradient_checkpointing: true
 61gradient_checkpointing_kwargs:
 62  use_reentrant: false
 63objective: grpo
 64grpo_loss_type: dr_grpo
 65num_iterations: 1
 66gen_temperature: 1.0
 67gen_top_p: 1.0
 68gen_top_k: -1
 69vllm_top_k: -1
 70gen_best_of: 1
 71vllm_request_timeout: 600
 72hub_model_id: null
 73hub_strategy: "end"
 74learning_rate: 1e-6
 75lr_scheduler_type: constant
 76optim: adamw_torch
 77adam_beta1: 0.9
 78adam_beta2: 0.95
 79adam_epsilon: 1e-08
 80weight_decay: 0.0
 81log_completions: true
 82log_level: 20
 83logging_first_step: true
 84logging_steps: 1
 85logging_strategy: steps
 86log_like_grpo: true
 87max_prompt_length: 1024
 88max_completion_length: 3000
 89max_steps: -1
 90num_generations: 8
 91num_train_epochs: 20
 92output_dir: var/data/Qwen2.5-1.5B-Open-R1-GRPO-BASELINE-math-v1
 93overwrite_output_dir: true
 94# Local batches are flattened over rollouts; 8 keeps one prompt-group per device.
 95per_device_train_batch_size: 8
 96per_device_eval_batch_size: 8
 97dataloader_num_workers: 4
 98dataloader_pin_memory: true
 99dataloader_prefetch_factor: 2
100dataloader_persistent_workers: true
101push_to_hub: false
102report_to:
103- wandb
104wandb_project: huggingface
105wandb_entity: ogd3-princeton-university
106reward_funcs:
107- seed_paper_boxed_accuracy_math
108reward_weights:
109- 1
110eval_reward_funcs:
111- seed_paper_boxed_accuracy_math
112eval_reward_weights:
113- 1
114scale_rewards: false
115beta: 0.0
116kl_target: 0.0
117kl_horizon: 0
118kl_ctl_step_size: 0.0
119grpo_beta_controller_enabled: false
120maxent_beta_controller_enabled: false
121max_grad_norm: 1.0
122ppo_clip_range: 0.2
123clip_range: 0.2
124clip_range_high: 0.2
125save_strategy: "no"
126save_steps: 1
127# Built-in trainer eval is disabled for these math runs; official paper eval is
128# scheduled directly by SeedPaperEvalCallback on start and every eval_steps.
129eval_strategy: "no"
130eval_steps: 10
131seed_paper_eval_template: "no"
132save_total_limit: 1000
133seed: 42
134warmup_ratio: 0.0

Paired Recipes (GRPO vs MaxEnt)

For reproducible comparisons, each model family ships paired GRPO and MaxEnt recipes that keep sampling, optimizer, and evaluation settings aligned. Use the GRPO recipe under grpo/ and the MaxEnt counterpart under maxent-grpo/:

  • GRPO: configs/recipes/<model>/grpo/config_math.yaml

  • MaxEnt: configs/recipes/<model>/maxent-grpo/config_math.yaml

  • Coding GRPO: configs/recipes/<model>/grpo/config_code_mbpp.yaml

  • Coding MaxEnt: configs/recipes/<model>/maxent-grpo/config_code_mbpp.yaml

Paired GRPO recipes pin maxent_reference_logprobs_source: model so both objectives use a frozen reference anchor for KL, and they keep optimizer and sampling settings aligned with the MaxEnt counterparts.

The paired Qwen 0.5B/1.5B maxent-grpo recipes default to trainer-level entropy MaxEnt (objective: maxent_entropy) and keep maxent_policy_entropy_mode: exact so the entropy term uses the correct backpropagated gradient. The frozen reference-model anchor remains matched to GRPO through maxent_reference_logprobs_source: model and the same beta-weighted KL term.

The older Qwen 7B maxent-grpo math recipe still uses the GRPO + entropy-bonus path. For trainer-level MaxEnt variants:

  • objective: maxent_entropy enables token-entropy regularization via maxent_alpha.

  • objective: maxent_listwise enables the tau/q/beta listwise weighting objective.

Tips

  • Adjust num_generations and max_completion_length to trade off speed vs. diversity

  • Set hub_model_id to point at your namespace

  • Toggle use_vllm depending on your setup

Hydra recipes

  • Baseline: configs/recipes/hydra/baseline_math.yaml

  • GRPO (paired parity): configs/recipes/hydra/grpo_custom_math.yaml

  • MaxEnt-GRPO: configs/recipes/hydra/maxent_math.yaml

  • Coding baseline (MBPP): configs/recipes/hydra/baseline_code_mbpp.yaml

  • Coding GRPO parity (MBPP): configs/recipes/hydra/grpo_custom_code_mbpp.yaml

  • Coding MaxEnt (MBPP): configs/recipes/hydra/maxent_code_mbpp.yaml

Hydra configs bundle command=... with a recipe path and optional overrides under baseline / maxent. They are a convenient way to share fully-specified CLI runs without long command lines.

For a clean four-way math comparison on the shared 1.5B base recipe, use:

  • GRPO parity: configs/recipes/hydra/grpo_custom_math.yaml

  • Entropy MaxEnt: configs/recipes/hydra/maxent_entropy_math.yaml

  • Listwise MaxEnt: configs/recipes/hydra/maxent_listwise_math.yaml

  • SEED-GRPO: configs/recipes/hydra/seed_grpo_math.yaml

See Method Identity for the exact mapping from these presets to algorithm family vs loss backend.