CLI Usage

The project exposes a single Hydra CLI surface focused on training:

  • maxent-grpo: top-level CLI (set command=... explicitly).

  • maxent-grpo-baseline: convenience wrapper for command=train-baseline.

Command Routing

Supported commands:

  • train-baseline: baseline GRPO training.

  • train-maxent: MaxEnt-GRPO training.

Recipes and Overrides

Training commands can load YAML recipes via:

  • $GRPO_RECIPE (environment variable), or

  • baseline.recipe=... / maxent.recipe=... command fields.

After loading a recipe, overrides are applied from command-specific script/training/model sections.

GRPO_RECIPE=configs/recipes/Qwen2.5-1.5B-Instruct/grpo/config_math.yaml \
  maxent-grpo-baseline baseline.training.output_dir=var/data/out
maxent-grpo command=train-maxent \
  maxent.recipe=configs/recipes/Qwen2.5-1.5B-Instruct/maxent-grpo/config_math.yaml \
  maxent.training.maxent_tau=0.2

Coding pipeline example (MBPP + test-based reward):

maxent-grpo-baseline \
  baseline.recipe=configs/recipes/Qwen2.5-0.5B-Instruct/grpo/config_code_mbpp.yaml

Validation

Before launch, maxent_grpo.cli.config_validation ensures MaxEnt overrides are only used with objective=maxent_entropy or objective=maxent_listwise except for GRPO + entropy-bonus runs where objective=grpo_entropy_bonus and policy_entropy_bonus_coef>0.

Examples

Hydra recipe presets live under configs/recipes/hydra/. For custom-loop GRPO parity runs, use configs/recipes/hydra/grpo_custom_math.yaml. For explicit trainer-level MaxEnt variants, use configs/recipes/hydra/maxent_entropy_math.yaml or configs/recipes/hydra/maxent_listwise_math.yaml.