Recipes¶

Use YAML recipes to keep runs reproducible and readable. The configs mirror the dataclasses in maxent_grpo.config and TRL’s GRPO settings.

Recipe Layout¶

Recipes are flat YAML mappings. Fields are routed automatically into three config objects based on their dataclass field names:

GRPOScriptArguments (dataset, evaluation, reward-related script knobs)
GRPOConfig (training, MaxEnt, vLLM, logging)
TRL ModelConfig (model name, dtype, revision, etc.)

Keys that do not match script or training fields are forwarded to the TRL ModelConfig when possible. Any remaining keys are ignored.

Minimal example:

model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
dataset_name: open-r1/OpenR1-Math-220k
output_dir: var/data/out
objective: maxent_listwise
maxent_tau: 0.2

Loading and Validation¶

maxent_grpo.config.load_grpo_recipe loads YAML with OmegaConf (or PyYAML), sets GRPO_RECIPE_USED to the resolved path, and applies a few convenience rules:

When use_vllm: true and vllm_mode: server, missing vllm_server_base_url / host / port are inferred from vllm_url.
MAXENT_LOG_LEVEL overrides log_level in the training config.

For flat recipes (no top-level script / training / model keys), schema validation enforces:

Baseline recipes must set beta.
MaxEnt recipes must use objective: maxent_entropy or objective: maxent_listwise unless they opt into GRPO + entropy bonus via objective: grpo_entropy_bonus with policy_entropy_bonus_coef>0 under train-maxent.

Validation is skipped during tests and only applies to flat recipe files (not Hydra configs).

Dataset Mixtures¶

To blend multiple datasets, set dataset_mixture instead of dataset_name:

dataset_mixture:
  seed: 42
  test_split_size: 0.02
  datasets:
    - id: open-r1/OpenR1-Math-220k
      split: train
      columns: [problem, answer]
      weight: 1.0
    - id: some/other-dataset
      split: train
      weight: 0.5

Each dataset can define a split, optional columns, and a sampling weight. The mixture loader validates column consistency and can carve out a test split on the combined dataset.

Math GRPO (Qwen 1.5B)¶

# Copyright 2025 Liv d'Aliberti
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Model arguments
model_name_or_path: Qwen/Qwen2.5-Math-1.5B
reference_model_name_or_path: Qwen/Qwen2.5-Math-1.5B
model_revision: main
resume_from_checkpoint: false
torch_dtype: float16
attn_implementation: sdpa

# Data training arguments
dataset_name: axon-rl/MATH-lvl3to5-8k
dataset_config: default
dataset_prompt_column: problem
dataset_solution_column: answer
eval_dataset_name: aime24,amc,math_500,minerva,olympiad_bench
eval_dataset_split: test
eval_dataset_prompt_column: problem
eval_dataset_solution_column: answer
disable_distributed_sampler: true
prompt_template: "no"
system_prompt: null
chat_template: null

# Dr.GRPO-style training/eval settings
bf16: true
fp16: false
use_vllm: true
vllm_mode: colocate
torch_compile: false
vllm_return_logprobs: true
vllm_request_logprobs: true
maxent_reference_logprobs_source: model
maxent_trl_reference_scoring: true
behavior_logprobs_source: model
# Score rollout batches in smaller slices to reduce peak logits memory without
# changing the exact loss computation.
maxent_logprob_chunk_size: 2
policy_entropy_bonus_coef: 0.0
maxent_length_normalize_ref: false
maxent_length_normalize_policy: false
maxent_policy_entropy: false
vllm_sync_weights: true
vllm_sync_interval_steps: 1
vllm_gpu_memory_utilization: 0.35
do_eval: false
gradient_accumulation_steps: 16
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
objective: grpo
grpo_loss_type: dr_grpo
num_iterations: 1
gen_temperature: 1.0
gen_top_p: 1.0
gen_top_k: -1
vllm_top_k: -1
gen_best_of: 1
vllm_request_timeout: 600
hub_model_id: null
hub_strategy: "end"
learning_rate: 1e-6
lr_scheduler_type: constant
optim: adamw_torch
adam_beta1: 0.9
adam_beta2: 0.95
adam_epsilon: 1e-08
weight_decay: 0.0
log_completions: true
log_level: 20
logging_first_step: true
logging_steps: 1
logging_strategy: steps
log_like_grpo: true
max_prompt_length: 1024
max_completion_length: 3000
max_steps: -1
num_generations: 8
num_train_epochs: 20
output_dir: var/data/Qwen2.5-1.5B-Open-R1-GRPO-BASELINE-math-v1
overwrite_output_dir: true
# Local batches are flattened over rollouts; 8 keeps one prompt-group per device.
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
dataloader_num_workers: 4
dataloader_pin_memory: true
dataloader_prefetch_factor: 2
dataloader_persistent_workers: true
push_to_hub: false
report_to:
- wandb
wandb_project: huggingface
wandb_entity: ogd3-princeton-university
reward_funcs:
- seed_paper_boxed_accuracy_math
reward_weights:
- 1
eval_reward_funcs:
- seed_paper_boxed_accuracy_math
eval_reward_weights:
- 1
scale_rewards: false
beta: 0.0
kl_target: 0.0
kl_horizon: 0
kl_ctl_step_size: 0.0
grpo_beta_controller_enabled: false
maxent_beta_controller_enabled: false
max_grad_norm: 1.0
ppo_clip_range: 0.2
clip_range: 0.2
clip_range_high: 0.2
save_strategy: "no"
save_steps: 1
# Built-in trainer eval is disabled for these math runs; official paper eval is
# scheduled directly by SeedPaperEvalCallback on start and every eval_steps.
eval_strategy: "no"
eval_steps: 10
seed_paper_eval_template: "no"
save_total_limit: 1000
seed: 42
warmup_ratio: 0.0

Paired Recipes (GRPO vs MaxEnt)¶

For reproducible comparisons, each model family ships paired GRPO and MaxEnt recipes that keep sampling, optimizer, and evaluation settings aligned. Use the GRPO recipe under grpo/ and the MaxEnt counterpart under maxent-grpo/:

GRPO: configs/recipes/<model>/grpo/config_math.yaml
MaxEnt: configs/recipes/<model>/maxent-grpo/config_math.yaml
Coding GRPO: configs/recipes/<model>/grpo/config_code_mbpp.yaml
Coding MaxEnt: configs/recipes/<model>/maxent-grpo/config_code_mbpp.yaml

Paired GRPO recipes pin maxent_reference_logprobs_source: model so both objectives use a frozen reference anchor for KL, and they keep optimizer and sampling settings aligned with the MaxEnt counterparts.

The paired Qwen 0.5B/1.5B maxent-grpo recipes default to trainer-level entropy MaxEnt (objective: maxent_entropy) and keep maxent_policy_entropy_mode: exact so the entropy term uses the correct backpropagated gradient. The frozen reference-model anchor remains matched to GRPO through maxent_reference_logprobs_source: model and the same beta-weighted KL term.

The older Qwen 7B maxent-grpo math recipe still uses the GRPO + entropy-bonus path. For trainer-level MaxEnt variants:

objective: maxent_entropy enables token-entropy regularization via maxent_alpha.
objective: maxent_listwise enables the tau/q/beta listwise weighting objective.

Tips¶

Adjust num_generations and max_completion_length to trade off speed vs. diversity
Set hub_model_id to point at your namespace
Toggle use_vllm depending on your setup

Hydra recipes¶

Baseline: configs/recipes/hydra/baseline_math.yaml
GRPO (paired parity): configs/recipes/hydra/grpo_custom_math.yaml
MaxEnt-GRPO: configs/recipes/hydra/maxent_math.yaml
Coding baseline (MBPP): configs/recipes/hydra/baseline_code_mbpp.yaml
Coding GRPO parity (MBPP): configs/recipes/hydra/grpo_custom_code_mbpp.yaml
Coding MaxEnt (MBPP): configs/recipes/hydra/maxent_code_mbpp.yaml

Hydra configs bundle command=... with a recipe path and optional overrides under baseline / maxent. They are a convenient way to share fully-specified CLI runs without long command lines.

For a clean four-way math comparison on the shared 1.5B base recipe, use:

GRPO parity: configs/recipes/hydra/grpo_custom_math.yaml
Entropy MaxEnt: configs/recipes/hydra/maxent_entropy_math.yaml
Listwise MaxEnt: configs/recipes/hydra/maxent_listwise_math.yaml
SEED-GRPO: configs/recipes/hydra/seed_grpo_math.yaml

See Method Identity for the exact mapping from these presets to algorithm family vs loss backend.