CLI Usage
=========

The project exposes a single Hydra CLI surface focused on training:

- ``maxent-grpo``: top-level CLI (set ``command=...`` explicitly).
- ``maxent-grpo-baseline``: convenience wrapper for ``command=train-baseline``.

Command Routing
---------------

Supported commands:

- ``train-baseline``: baseline GRPO training.
- ``train-maxent``: MaxEnt-GRPO training.

Recipes and Overrides
---------------------

Training commands can load YAML recipes via:

- ``$GRPO_RECIPE`` (environment variable), or
- ``baseline.recipe=...`` / ``maxent.recipe=...`` command fields.

After loading a recipe, overrides are applied from command-specific
``script``/``training``/``model`` sections.

.. code-block:: bash

   GRPO_RECIPE=configs/recipes/Qwen2.5-1.5B-Instruct/grpo/config_math.yaml \
     maxent-grpo-baseline baseline.training.output_dir=var/data/out

.. code-block:: bash

   maxent-grpo command=train-maxent \
     maxent.recipe=configs/recipes/Qwen2.5-1.5B-Instruct/maxent-grpo/config_math.yaml \
     maxent.training.maxent_tau=0.2

Coding pipeline example (MBPP + test-based reward):

.. code-block:: bash

   maxent-grpo-baseline \
     baseline.recipe=configs/recipes/Qwen2.5-0.5B-Instruct/grpo/config_code_mbpp.yaml

Validation
----------

Before launch, ``maxent_grpo.cli.config_validation`` ensures MaxEnt overrides
are only used with ``objective=maxent_entropy`` or
``objective=maxent_listwise`` except for GRPO + entropy-bonus runs where
``objective=grpo_entropy_bonus`` and ``policy_entropy_bonus_coef>0``.

Examples
--------

Hydra recipe presets live under ``configs/recipes/hydra/``.
For custom-loop GRPO parity runs, use
``configs/recipes/hydra/grpo_custom_math.yaml``.
For explicit trainer-level MaxEnt variants, use
``configs/recipes/hydra/maxent_entropy_math.yaml`` or
``configs/recipes/hydra/maxent_listwise_math.yaml``.