Recipes¶
Use YAML recipes to keep runs reproducible and readable. The configs mirror the dataclasses in maxent_grpo.config and TRL’s GRPO settings.
Recipe Layout¶
Recipes are flat YAML mappings. Fields are routed automatically into three config objects based on their dataclass field names:
GRPOScriptArguments(dataset, evaluation, reward-related script knobs)GRPOConfig(training, MaxEnt, vLLM, logging)TRL
ModelConfig(model name, dtype, revision, etc.)
Keys that do not match script or training fields are forwarded to the TRL
ModelConfig when possible. Any remaining keys are ignored.
Minimal example:
model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
dataset_name: open-r1/OpenR1-Math-220k
output_dir: var/data/out
objective: maxent_listwise
maxent_tau: 0.2
Loading and Validation¶
maxent_grpo.config.load_grpo_recipe loads YAML with OmegaConf (or PyYAML),
sets GRPO_RECIPE_USED to the resolved path, and applies a few convenience
rules:
When
use_vllm: trueandvllm_mode: server, missingvllm_server_base_url/ host / port are inferred fromvllm_url.MAXENT_LOG_LEVELoverrideslog_levelin the training config.
For flat recipes (no top-level script / training / model keys),
schema validation enforces:
Baseline recipes must set
beta.MaxEnt recipes must use
objective: maxent_entropyorobjective: maxent_listwiseunless they opt into GRPO + entropy bonus viaobjective: grpo_entropy_bonuswithpolicy_entropy_bonus_coef>0undertrain-maxent.
Validation is skipped during tests and only applies to flat recipe files (not Hydra configs).
Dataset Mixtures¶
To blend multiple datasets, set dataset_mixture instead of dataset_name:
dataset_mixture:
seed: 42
test_split_size: 0.02
datasets:
- id: open-r1/OpenR1-Math-220k
split: train
columns: [problem, answer]
weight: 1.0
- id: some/other-dataset
split: train
weight: 0.5
Each dataset can define a split, optional columns, and a sampling weight. The mixture loader validates column consistency and can carve out a test split on the combined dataset.
Math GRPO (Qwen 1.5B)¶
1# Copyright 2025 Liv d'Aliberti
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7# http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14
15# Model arguments
16model_name_or_path: Qwen/Qwen2.5-Math-1.5B
17reference_model_name_or_path: Qwen/Qwen2.5-Math-1.5B
18model_revision: main
19resume_from_checkpoint: false
20torch_dtype: float16
21attn_implementation: sdpa
22
23# Data training arguments
24dataset_name: axon-rl/MATH-lvl3to5-8k
25dataset_config: default
26dataset_prompt_column: problem
27dataset_solution_column: answer
28eval_dataset_name: aime24,amc,math_500,minerva,olympiad_bench
29eval_dataset_split: test
30eval_dataset_prompt_column: problem
31eval_dataset_solution_column: answer
32disable_distributed_sampler: true
33prompt_template: "no"
34system_prompt: null
35chat_template: null
36
37# Dr.GRPO-style training/eval settings
38bf16: true
39fp16: false
40use_vllm: true
41vllm_mode: colocate
42torch_compile: false
43vllm_return_logprobs: true
44vllm_request_logprobs: true
45maxent_reference_logprobs_source: model
46maxent_trl_reference_scoring: true
47behavior_logprobs_source: model
48# Score rollout batches in smaller slices to reduce peak logits memory without
49# changing the exact loss computation.
50maxent_logprob_chunk_size: 2
51policy_entropy_bonus_coef: 0.0
52maxent_length_normalize_ref: false
53maxent_length_normalize_policy: false
54maxent_policy_entropy: false
55vllm_sync_weights: true
56vllm_sync_interval_steps: 1
57vllm_gpu_memory_utilization: 0.35
58do_eval: false
59gradient_accumulation_steps: 16
60gradient_checkpointing: true
61gradient_checkpointing_kwargs:
62 use_reentrant: false
63objective: grpo
64grpo_loss_type: dr_grpo
65num_iterations: 1
66gen_temperature: 1.0
67gen_top_p: 1.0
68gen_top_k: -1
69vllm_top_k: -1
70gen_best_of: 1
71vllm_request_timeout: 600
72hub_model_id: null
73hub_strategy: "end"
74learning_rate: 1e-6
75lr_scheduler_type: constant
76optim: adamw_torch
77adam_beta1: 0.9
78adam_beta2: 0.95
79adam_epsilon: 1e-08
80weight_decay: 0.0
81log_completions: true
82log_level: 20
83logging_first_step: true
84logging_steps: 1
85logging_strategy: steps
86log_like_grpo: true
87max_prompt_length: 1024
88max_completion_length: 3000
89max_steps: -1
90num_generations: 8
91num_train_epochs: 20
92output_dir: var/data/Qwen2.5-1.5B-Open-R1-GRPO-BASELINE-math-v1
93overwrite_output_dir: true
94# Local batches are flattened over rollouts; 8 keeps one prompt-group per device.
95per_device_train_batch_size: 8
96per_device_eval_batch_size: 8
97dataloader_num_workers: 4
98dataloader_pin_memory: true
99dataloader_prefetch_factor: 2
100dataloader_persistent_workers: true
101push_to_hub: false
102report_to:
103- wandb
104wandb_project: huggingface
105wandb_entity: ogd3-princeton-university
106reward_funcs:
107- seed_paper_boxed_accuracy_math
108reward_weights:
109- 1
110eval_reward_funcs:
111- seed_paper_boxed_accuracy_math
112eval_reward_weights:
113- 1
114scale_rewards: false
115beta: 0.0
116kl_target: 0.0
117kl_horizon: 0
118kl_ctl_step_size: 0.0
119grpo_beta_controller_enabled: false
120maxent_beta_controller_enabled: false
121max_grad_norm: 1.0
122ppo_clip_range: 0.2
123clip_range: 0.2
124clip_range_high: 0.2
125save_strategy: "no"
126save_steps: 1
127# Built-in trainer eval is disabled for these math runs; official paper eval is
128# scheduled directly by SeedPaperEvalCallback on start and every eval_steps.
129eval_strategy: "no"
130eval_steps: 10
131seed_paper_eval_template: "no"
132save_total_limit: 1000
133seed: 42
134warmup_ratio: 0.0
Paired Recipes (GRPO vs MaxEnt)¶
For reproducible comparisons, each model family ships paired GRPO and
MaxEnt recipes that keep sampling, optimizer, and evaluation settings aligned.
Use the GRPO recipe under grpo/ and the MaxEnt counterpart under
maxent-grpo/:
GRPO:
configs/recipes/<model>/grpo/config_math.yamlMaxEnt:
configs/recipes/<model>/maxent-grpo/config_math.yamlCoding GRPO:
configs/recipes/<model>/grpo/config_code_mbpp.yamlCoding MaxEnt:
configs/recipes/<model>/maxent-grpo/config_code_mbpp.yaml
Paired GRPO recipes pin maxent_reference_logprobs_source: model so both
objectives use a frozen reference anchor for KL, and they keep optimizer and
sampling settings aligned with the MaxEnt counterparts.
The paired Qwen 0.5B/1.5B maxent-grpo recipes default to trainer-level
entropy MaxEnt (objective: maxent_entropy) and keep
maxent_policy_entropy_mode: exact so the entropy term uses the correct
backpropagated gradient. The frozen reference-model anchor remains matched to
GRPO through maxent_reference_logprobs_source: model and the same
beta-weighted KL term.
The older Qwen 7B maxent-grpo math recipe still uses the GRPO +
entropy-bonus path. For trainer-level MaxEnt variants:
objective: maxent_entropyenables token-entropy regularization viamaxent_alpha.objective: maxent_listwiseenables the tau/q/beta listwise weighting objective.
Tips¶
Adjust
num_generationsandmax_completion_lengthto trade off speed vs. diversitySet
hub_model_idto point at your namespaceToggle
use_vllmdepending on your setup
Hydra recipes¶
Baseline:
configs/recipes/hydra/baseline_math.yamlGRPO (paired parity):
configs/recipes/hydra/grpo_custom_math.yamlMaxEnt-GRPO:
configs/recipes/hydra/maxent_math.yamlCoding baseline (MBPP):
configs/recipes/hydra/baseline_code_mbpp.yamlCoding GRPO parity (MBPP):
configs/recipes/hydra/grpo_custom_code_mbpp.yamlCoding MaxEnt (MBPP):
configs/recipes/hydra/maxent_code_mbpp.yaml
Hydra configs bundle command=... with a recipe path and optional overrides
under baseline / maxent. They are a convenient way to
share fully-specified CLI runs without long command lines.
For a clean four-way math comparison on the shared 1.5B base recipe, use:
GRPO parity:
configs/recipes/hydra/grpo_custom_math.yamlEntropy MaxEnt:
configs/recipes/hydra/maxent_entropy_math.yamlListwise MaxEnt:
configs/recipes/hydra/maxent_listwise_math.yamlSEED-GRPO:
configs/recipes/hydra/seed_grpo_math.yaml
See Method Identity for the exact mapping from these presets to algorithm family vs loss backend.