maxent_grpo.config.grpo¶
GRPO-specific configuration dataclasses with MaxEnt extensions.
These classes layer additional benchmarking, telemetry, weighting, and vLLM generation controls on top of TRL’s GRPO configuration so training recipes can be expressed declaratively. The TRL dependency is optional during imports to keep documentation builds and tests lightweight.
License Copyright 2025 Liv d’Aliberti
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Functions
|
Collapse aliases/custom objects into a string accepted by upstream base config. |
|
Convert config aliases into a value accepted by upstream TrainingArguments. |
|
Resolve a logging level specified as a name or numeric value. |
|
Best-effort resolve of |
Classes
|
GRPO configuration extended for MaxEnt-GRPO experiments. |
|
Script arguments for the GRPO training script. |
- class maxent_grpo.config.grpo.GRPOConfig(benchmarks=<factory>, eval_before_train=False, seed_paper_eval_enabled=False, seed_paper_eval_template=None, seed_paper_eval_tasks=None, seed_paper_eval_workspace_dir=None, seed_paper_eval_results_dir=None, seed_paper_eval_python=None, seed_paper_eval_max_test=999999, seed_paper_eval_vllm_batch_size=32, seed_paper_eval_timeout_s=14400, seed_paper_eval_pass_at_8_enabled=False, seed_paper_eval_pass_at_8_samples=8, seed_paper_eval_pass_at_8_temperature=1.0, seed_paper_eval_pass_at_8_top_p=1.0, seed_paper_eval_fail_on_error=False, final_model_save_enabled=True, callbacks=<factory>, eval_strategy=None, num_generations=8, reward_funcs=<factory>, reward_weights=<factory>, seed_paper_reward_fast=False, chat_template=None, prompt_template='no', hub_model_revision='main', reference_model_name_or_path=None, reference_model_revision=None, maxent_share_reference_model=False, maxent_reference_ema_enabled=True, maxent_reference_ema_beta=0.995, maxent_reference_ema_update_interval=10, maxent_reference_ema_warmup_steps=100, num_completions_to_print=0, overwrite_hub_revision=False, push_to_hub_revision=False, system_prompt=None, wandb_log_unique_prompts=True, wandb_entity=None, wandb_project=None, wandb_run_group=None, log_like_grpo=False, torch_compile=False, init_from_checkpoint=None, resume_from_checkpoint=None, objective='maxent_entropy', maxent_alpha=0.0, maxent_tau=0.0, maxent_q_temperature=1.0, maxent_q_epsilon=1e-06, maxent_length_normalize_ref=True, maxent_length_normalize_policy=False, maxent_logprob_chunk_size=0, maxent_reference_logprobs_source='auto', maxent_trl_reference_scoring=True, behavior_logprobs_source='model', maxent_allow_stale_reference_logprobs=False, maxent_score_tail_tokens=None, maxent_policy_entropy=False, maxent_policy_entropy_mode='exact', policy_entropy_bonus_coef=0.0, maxent_alpha_raise_on_low_kl=False, maxent_alpha_lower_on_high_kl=False, maxent_alpha_kl_threshold=0.04, maxent_alpha_kl_gain=1.0, maxent_alpha_kl_max_multiplier=2.0, maxent_alpha_kl_min_multiplier=0.5, maxent_alpha_disable_outside_trust_zone=False, maxent_score_slice_prefetch=0, maxent_prompt_cache_size=10000, maxent_target_weight_entropy=None, maxent_target_weight_entropy_start=None, maxent_target_weight_entropy_final=None, maxent_target_weight_entropy_horizon=0, maxent_tau_lr=0.0, maxent_tau_min=0.0, maxent_tau_max=0.0, maxent_tau_warmup_steps=-1, controller_meta_enabled=False, controller_meta_method='analytic', controller_meta_lr=0.0, controller_meta_tau_lr=0.0, controller_meta_beta_lr=0.0, controller_meta_beta_grad_clip=0.0, controller_meta_update_interval=1, maxent_use_clip_objective=False, maxent_clip_objective_coef=1.0, maxent_clip_adv_baseline=None, controller_overwrite_from_config=False, scale_rewards=True, seed_grpo_enabled=False, seed_grpo_alpha=0.0417, seed_grpo_alpha_normalize_by_max_entropy=True, seed_grpo_length_normalize_logprobs=True, maxent_allow_empty_weight_fallback=False, maxent_listwise_skip_zero_variance_groups=False, maxent_clip_range=None, clip_range=0.0, clip_range_high=None, clip_delta=None, grpo_loss_type='bnpo', dr_grpo_denominator_mode='fixed_max', missing_boxed_answer_penalty=-0.05, beta=None, kl_target=0.0, kl_horizon=0, kl_ctl_step_size=0.0, grpo_beta_controller_enabled=False, maxent_beta_controller_enabled=False, gen_temperature=0.8, gen_top_p=0.9, greedy_eval_enabled=False, eval_greedy_only_enabled=False, truncate_completions_at_first_boxed_answer=False, vllm_mode='server', vllm_url='http://localhost:8000/generate', vllm_max_completion_rounds=0, vllm_retry_sleep=0.5, vllm_backfill_with_model=True, vllm_return_logprobs=True, vllm_force_logprobs=False, vllm_logprob_fail_after=None, vllm_logprob_fallback=None, vllm_client_tag_fail_fast=None, vllm_sync_interval_steps=1, vllm_sync_weights=False, vllm_best_of=None, vllm_frequency_penalty=0.0, vllm_presence_penalty=0.0, vllm_top_k=None, vllm_stop_sequences=None, vllm_include_stop_str_in_output=None, zero_truncated_completion_rewards=False, dataloader_num_workers=0, dataloader_pin_memory=None, dataloader_prefetch_factor=None, dataloader_persistent_workers=None, disable_distributed_sampler=False, log_level='warning', log_completions=False, rich_log_completions=False, rich_log_completions_key='rich_completions', rich_log_completions_to_wandb=False, rich_log_completions_synchronize_ranks=True)[source]¶
Bases:
GRPOConfigGRPO configuration extended for MaxEnt-GRPO experiments.
Adds logging hooks, benchmark orchestration, weighting controls, and vLLM integration options on top of
trl.GRPOConfig.- Variables:
benchmarks – Benchmarks to run after training completes.
callbacks – Callback identifiers executed during training.
chat_template – Optional legacy chat template string used to render prompts.
prompt_template – Training-side prompt template selector. Supports the official SEED-GRPO templates
qwen_math,no, andr1.hub_model_revision – Hub model branch to push artifacts to.
num_completions_to_print – Number of completions to print for inspection.
overwrite_hub_revision – Whether to overwrite the destination Hub branch.
push_to_hub_revision – Whether to push training outputs to the Hub.
system_prompt – System prompt injected before training prompts.
wandb_log_unique_prompts – Log unique prompts to W&B runs.
wandb_entity – W&B entity/organization for run tracking.
wandb_project – W&B project name.
wandb_run_group – W&B group used to cluster related runs.
objective – Canonical objective selector.
grpokeeps the native GRPO loss,grpo_entropy_bonusadds a reward-side entropy bonus,maxent_entropyapplies entropy regularization in the loss, andmaxent_listwiseuses tau/q/beta listwise weighting.maxent_alpha – Coefficient for the entropy-regularized MaxEnt loss.
maxent_tau – Sequence-level entropy weight used by MaxEnt-GRPO.
maxent_q_temperature – Temperature applied when forming listwise q values.
maxent_q_epsilon – Minimum support added to q before normalization.
maxent_length_normalize_ref – Length-normalize reference log-probs.
maxent_length_normalize_policy – Length-normalize policy/behavior sequence log-probs inside the listwise loss.
maxent_logprob_chunk_size – Mini-batch size when computing log-probs.
maxent_policy_entropy – Whether to compute policy entropy during scoring.
maxent_policy_entropy_mode – Which entropy estimator to use (“exact” or “sample”).
sampleis only valid for logging or GRPO reward-side entropy bonuses; entropy-regularized MaxEnt loss requiresexact.policy_entropy_bonus_coef – Coefficient applied to per-token policy entropy when adding an entropy bonus to rewards (GRPO + entropy bonus).
maxent_alpha_raise_on_low_kl – When true, allow adaptive MaxEnt alpha increases while measured train KL remains below
maxent_alpha_kl_threshold.maxent_alpha_lower_on_high_kl – When true, allow adaptive MaxEnt alpha decreases while measured train KL remains above
maxent_alpha_kl_threshold.maxent_alpha_kl_threshold – KL boundary used by adaptive alpha control.
maxent_alpha_kl_gain – Gain for KL-based alpha scaling in both directions.
maxent_alpha_kl_max_multiplier – Upper bound on the adaptive alpha multiplier.
maxent_alpha_kl_min_multiplier – Lower bound on the adaptive alpha multiplier.
maxent_alpha_disable_outside_trust_zone – When true, disable the entropy bonus on batches whose measured KL is non-finite or exceeds
maxent_alpha_kl_threshold.maxent_reference_ema_enabled – When true, softly update the frozen TRL reference model from policy weights during training.
maxent_reference_ema_beta – EMA momentum for reference updates. Higher values move the reference model more slowly.
maxent_reference_ema_update_interval – Apply the EMA update every N train steps.
maxent_reference_ema_warmup_steps – Number of train steps to wait before EMA reference updates begin.
behavior_logprobs_source – Source for behavior-policy log-probs used in PPO ratios.
maxent_target_weight_entropy – Target weight entropy for automatic tau tuning.
maxent_target_weight_entropy_start – Optional starting entropy target for annealing.
maxent_target_weight_entropy_final – Optional final entropy target for annealing.
maxent_target_weight_entropy_horizon – Steps to interpolate between start/final targets.
maxent_tau_lr – Learning rate applied during tau adaptation.
maxent_tau_min – Lower bound enforced on tau during tuning.
maxent_tau_max – Upper bound enforced on tau during tuning.
maxent_tau_warmup_steps – Warmup steps before enabling tau adaptation.
maxent_use_clip_objective – Blend a PPO-style clipped objective into the loss.
maxent_clip_objective_coef – Scale for the clipped objective component.
maxent_clip_adv_baseline – Baseline subtracted before clipping.
scale_rewards – Whether to scale GRPO advantages by group std (TRL default).
seed_grpo_enabled – Enable SEED-GRPO semantic-entropy advantage scaling.
seed_grpo_alpha – Base alpha used by the official SEED-GRPO implementation.
seed_grpo_alpha_normalize_by_max_entropy – Divide
seed_grpo_alphabylog(num_generations)to match the official SEED-GRPO code path.seed_grpo_length_normalize_logprobs – Use mean token log-probabilities when computing SEED-GRPO semantic entropy.
maxent_clip_range – Override PPO clip range for the MaxEnt objective.
beta – Optional alias for the KL coefficient used by legacy GRPO/Open-R1 configs. When provided, it is mirrored to
init_kl_coeff.kl_target – Target KL value for automatic beta adjustment.
kl_horizon – Horizon in optimizer steps for the beta controller.
kl_ctl_step_size – Maximum fractional beta change per controller step.
maxent_beta_controller_enabled – Enable adaptive beta updates for MaxEnt objectives. Keep false for a fixed-beta comparison.
clip_range – PPO clip range used for clipping ratios in training loss.
clip_range_high – Upper PPO clip range (epsilon_high) for asymmetric clipping.
clip_delta – Optional additional slack for two-sided clipping.
grpo_loss_type – GRPO loss aggregation (“grpo”, “bnpo”, or “dr_grpo”).
dr_grpo_denominator_mode – Denominator used by Dr.GRPO.
fixed_maxmatches the original batch_size * max_completion_length normalization, whileactive_tokensnormalizes by the realized number of completion tokens.missing_boxed_answer_penalty – Auxiliary penalty applied by
missing_boxed_answer_penalty_mathwhen a completion never emits a boxed final answer. Keep non-positive.gen_temperature – Temperature used for candidate generation.
gen_top_p – Top-p nucleus sampling used for generation.
greedy_eval_enabled – Run an additional deterministic greedy eval pass alongside the existing sampled eval metrics.
eval_greedy_only_enabled – Replace the sampled in-training eval rollout path with a single greedy completion per prompt. This is much lighter than pass@k eval and is intended for frequent training-time checks.
truncate_completions_at_first_boxed_answer – When true, trim sampled or greedy completions at the first syntactically valid
\boxed{...}or\fbox{...}answer before rewards, logging, and loss consume them.vllm_mode – vLLM backend mode (“server” or “colocate”).
vllm_url – Base URL for the vLLM
/generateendpoint.vllm_max_completion_rounds – Maximum number of retries to top off completions.
vllm_retry_sleep – Seconds to sleep between vLLM retries.
vllm_backfill_with_model – Fallback to local
model.generatewhen vLLM misses completions.vllm_return_logprobs – Request per-token logprobs from vLLM.
vllm_logprob_fail_after – Consecutive steps with missing vLLM logprobs before aborting (0 disables).
vllm_logprob_fallback – When true, switch reference logprobs to the model after missing vLLM logprobs.
vllm_client_tag_fail_fast – When true, abort vLLM retries immediately on client_tag mismatch.
vllm_sync_interval_steps – Only sync weights every N optimizer steps when using vLLM sync.
vllm_best_of – vLLM
best_ofparameter forwarded from TRL.vllm_frequency_penalty – Frequency penalty applied during sampling.
vllm_presence_penalty – Presence penalty applied during sampling.
vllm_top_k – Top-k sampling parameter forwarded to vLLM.
vllm_stop_sequences – Stop sequences for vLLM (JSON list or
'||'-delimited string).vllm_include_stop_str_in_output – Preserve matched stop strings in the returned vLLM text. OAT enables this for the
r1math template so answer-tag grading still sees</answer>.seed_paper_reward_fast – Use the OAT/SEED “fast” math verifier path for online binary math rewards. When false, the grader also runs the slower
math_verify-style fallback.zero_truncated_completion_rewards – Zero sequence rewards for completions that hit the generation length cap, matching OAT’s math RL actor behavior.
eval_before_train – Run evaluation once before training begins (step 0).
disable_distributed_sampler – Disable the DistributedSampler to avoid double sharding.
dataloader_num_workers – Number of worker processes for the training dataloader.
dataloader_pin_memory – Whether to pin memory in the training dataloader.
dataloader_prefetch_factor – Prefetch factor per worker (only when num_workers > 0).
dataloader_persistent_workers – Keep DataLoader workers alive between epochs.
- Raises:
ValueError – If validation detects negative or inconsistent hyperparameters.
- Parameters:
eval_before_train (bool)
seed_paper_eval_enabled (bool)
seed_paper_eval_template (str | None)
seed_paper_eval_tasks (str | None)
seed_paper_eval_workspace_dir (str | None)
seed_paper_eval_results_dir (str | None)
seed_paper_eval_python (str | None)
seed_paper_eval_max_test (int)
seed_paper_eval_vllm_batch_size (int)
seed_paper_eval_timeout_s (int)
seed_paper_eval_pass_at_8_enabled (bool)
seed_paper_eval_pass_at_8_samples (int)
seed_paper_eval_pass_at_8_temperature (float)
seed_paper_eval_pass_at_8_top_p (float)
seed_paper_eval_fail_on_error (bool)
final_model_save_enabled (bool)
eval_strategy (str | None)
num_generations (int)
seed_paper_reward_fast (bool)
chat_template (str | None)
prompt_template (str | None)
hub_model_revision (str | None)
reference_model_name_or_path (str | None)
reference_model_revision (str | None)
maxent_share_reference_model (bool)
maxent_reference_ema_enabled (bool)
maxent_reference_ema_beta (float)
maxent_reference_ema_update_interval (int)
maxent_reference_ema_warmup_steps (int)
num_completions_to_print (int)
overwrite_hub_revision (bool)
push_to_hub_revision (bool)
system_prompt (str | None)
wandb_log_unique_prompts (bool)
wandb_entity (str | None)
wandb_project (str | None)
wandb_run_group (str | None)
log_like_grpo (bool)
torch_compile (bool)
init_from_checkpoint (str | None)
resume_from_checkpoint (str | None)
objective (str)
maxent_alpha (float)
maxent_tau (float)
maxent_q_temperature (float)
maxent_q_epsilon (float)
maxent_length_normalize_ref (bool)
maxent_length_normalize_policy (bool)
maxent_logprob_chunk_size (int)
maxent_reference_logprobs_source (str)
maxent_trl_reference_scoring (bool)
behavior_logprobs_source (str)
maxent_allow_stale_reference_logprobs (bool)
maxent_score_tail_tokens (int | None)
maxent_policy_entropy (bool)
maxent_policy_entropy_mode (str)
policy_entropy_bonus_coef (float)
maxent_alpha_raise_on_low_kl (bool)
maxent_alpha_lower_on_high_kl (bool)
maxent_alpha_kl_threshold (float)
maxent_alpha_kl_gain (float)
maxent_alpha_kl_max_multiplier (float)
maxent_alpha_kl_min_multiplier (float)
maxent_alpha_disable_outside_trust_zone (bool)
maxent_score_slice_prefetch (int)
maxent_prompt_cache_size (int)
maxent_target_weight_entropy (float | None)
maxent_target_weight_entropy_start (float | None)
maxent_target_weight_entropy_final (float | None)
maxent_target_weight_entropy_horizon (int)
maxent_tau_lr (float)
maxent_tau_min (float)
maxent_tau_max (float)
maxent_tau_warmup_steps (int)
controller_meta_enabled (bool)
controller_meta_method (str)
controller_meta_lr (float)
controller_meta_tau_lr (float)
controller_meta_beta_lr (float)
controller_meta_beta_grad_clip (float)
controller_meta_update_interval (int)
maxent_use_clip_objective (bool)
maxent_clip_objective_coef (float)
maxent_clip_adv_baseline (float | None)
controller_overwrite_from_config (bool)
scale_rewards (bool)
seed_grpo_enabled (bool)
seed_grpo_alpha (float)
seed_grpo_alpha_normalize_by_max_entropy (bool)
seed_grpo_length_normalize_logprobs (bool)
maxent_allow_empty_weight_fallback (bool)
maxent_listwise_skip_zero_variance_groups (bool)
maxent_clip_range (float | None)
clip_range (float)
clip_range_high (float | None)
clip_delta (float | None)
grpo_loss_type (str)
dr_grpo_denominator_mode (str)
missing_boxed_answer_penalty (float)
beta (float | None)
kl_target (float)
kl_horizon (int)
kl_ctl_step_size (float)
grpo_beta_controller_enabled (bool)
maxent_beta_controller_enabled (bool)
gen_temperature (float)
gen_top_p (float)
greedy_eval_enabled (bool)
eval_greedy_only_enabled (bool)
truncate_completions_at_first_boxed_answer (bool)
vllm_mode (str)
vllm_url (str | None)
vllm_max_completion_rounds (int)
vllm_retry_sleep (float)
vllm_backfill_with_model (bool)
vllm_return_logprobs (bool)
vllm_force_logprobs (bool)
vllm_logprob_fail_after (int | None)
vllm_logprob_fallback (bool | None)
vllm_client_tag_fail_fast (bool | None)
vllm_sync_interval_steps (int | None)
vllm_sync_weights (bool)
vllm_best_of (int | None)
vllm_frequency_penalty (float)
vllm_presence_penalty (float)
vllm_top_k (int | None)
vllm_stop_sequences (str | None)
vllm_include_stop_str_in_output (bool | None)
zero_truncated_completion_rewards (bool)
dataloader_num_workers (int)
dataloader_pin_memory (bool | None)
dataloader_prefetch_factor (int | None)
dataloader_persistent_workers (bool | None)
disable_distributed_sampler (bool)
log_completions (bool)
rich_log_completions (bool)
rich_log_completions_key (str)
rich_log_completions_to_wandb (bool)
rich_log_completions_synchronize_ranks (bool)
- class maxent_grpo.config.grpo.GRPOScriptArguments(dataset_name=None, dataset_mixture=None, dataset_config=None, eval_reward_funcs=<factory>, eval_reward_weights=<factory>, cosine_min_value_wrong=0.0, cosine_max_value_wrong=-0.5, cosine_min_value_correct=0.5, cosine_max_value_correct=1.0, cosine_max_len=1000, repetition_n_grams=3, repetition_max_penalty=-1.0, dataset_prompt_column='problem', dataset_solution_column='answer', eval_dataset_name=None, eval_dataset_config=None, eval_dataset_split='validation', eval_dataset_prompt_column=None, eval_dataset_solution_column=None, max_completion_len=16384, soft_punish_cache=4096, span_kl_target=0.05, span_kl_beta0=0.12, span_kl_horizon=10000)[source]¶
Bases:
ScriptArgumentsScript arguments for the GRPO training script.
Extends
ScriptArgumentswith reward, dataset, and evaluation knobs used by MaxEnt-GRPO training pipelines.- Variables:
cosine_min_value_wrong – Minimum reward when the answer is wrong.
cosine_max_value_wrong – Maximum reward when the answer is wrong.
cosine_min_value_correct – Minimum reward for correct answers.
cosine_max_value_correct – Maximum reward for correct answers.
cosine_max_len – Maximum length considered when scaling cosine reward.
repetition_n_grams – N-gram size for repetition penalty rewards.
repetition_max_penalty – Maximum negative penalty for repetition rewards.
dataset_prompt_column – Column used as prompts during training.
dataset_solution_column – Column containing the reference solution.
eval_dataset_name – Dataset to use for evaluation when different from training.
eval_dataset_config – Config name for the evaluation dataset.
eval_dataset_split – Split to read from the evaluation dataset.
eval_dataset_prompt_column – Prompt column for the evaluation dataset.
eval_dataset_solution_column – Solution column for the evaluation dataset.
max_completion_len – Maximum completion length in characters.
soft_punish_cache – Minimum completion length before applying a soft penalty.
span_kl_target – Per-token KL target used by the span KL controller.
span_kl_beta0 – Initial KL coefficient for span KL regularization.
span_kl_horizon – Horizon (steps) for the span KL controller.
- Parameters:
dataset_name (str | None)
dataset_mixture (DatasetMixtureConfig | None)
dataset_config (str | None)
cosine_min_value_wrong (float)
cosine_max_value_wrong (float)
cosine_min_value_correct (float)
cosine_max_value_correct (float)
cosine_max_len (int)
repetition_n_grams (int)
repetition_max_penalty (float)
dataset_prompt_column (str)
dataset_solution_column (str)
eval_dataset_name (str | None)
eval_dataset_config (str | None)
eval_dataset_split (str)
eval_dataset_prompt_column (str | None)
eval_dataset_solution_column (str | None)
max_completion_len (int)
soft_punish_cache (int)
span_kl_target (float)
span_kl_beta0 (float)
span_kl_horizon (int)