CLI Usage¶

The project exposes a single Hydra CLI surface focused on training:

maxent-grpo: top-level CLI (set command=... explicitly).
maxent-grpo-baseline: convenience wrapper for command=train-baseline.

Command Routing¶

Supported commands:

train-baseline: baseline GRPO training.
train-maxent: MaxEnt-GRPO training.

Recipes and Overrides¶

Training commands can load YAML recipes via:

$GRPO_RECIPE (environment variable), or
baseline.recipe=... / maxent.recipe=... command fields.

After loading a recipe, overrides are applied from command-specific script/training/model sections.

GRPO_RECIPE=configs/recipes/Qwen2.5-1.5B-Instruct/grpo/config_math.yaml \
  maxent-grpo-baseline baseline.training.output_dir=var/data/out

maxent-grpo command=train-maxent \
  maxent.recipe=configs/recipes/Qwen2.5-1.5B-Instruct/maxent-grpo/config_math.yaml \
  maxent.training.maxent_tau=0.2

Coding pipeline example (MBPP + test-based reward):

maxent-grpo-baseline \
  baseline.recipe=configs/recipes/Qwen2.5-0.5B-Instruct/grpo/config_code_mbpp.yaml

Validation¶

Before launch, maxent_grpo.cli.config_validation ensures MaxEnt overrides are only used with objective=maxent_entropy or objective=maxent_listwise except for GRPO + entropy-bonus runs where objective=grpo_entropy_bonus and policy_entropy_bonus_coef>0.

Examples¶

Hydra recipe presets live under configs/recipes/hydra/. For custom-loop GRPO parity runs, use configs/recipes/hydra/grpo_custom_math.yaml. For explicit trainer-level MaxEnt variants, use configs/recipes/hydra/maxent_entropy_math.yaml or configs/recipes/hydra/maxent_listwise_math.yaml.