maxent_grpo.config.grpo

GRPO-specific configuration dataclasses with MaxEnt extensions.

These classes layer additional benchmarking, telemetry, weighting, and vLLM generation controls on top of TRL’s GRPO configuration so training recipes can be expressed declaratively. The TRL dependency is optional during imports to keep documentation builds and tests lightweight.

License Copyright 2025 Liv d’Aliberti

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Functions

_coerce_eval_strategy_for_base(value)

Collapse aliases/custom objects into a string accepted by upstream base config.

_normalize_eval_strategy(value)

Convert config aliases into a value accepted by upstream TrainingArguments.

_parse_log_level(value)

Resolve a logging level specified as a name or numeric value.

_resolve_interval_strategy_cls()

Best-effort resolve of transformers.training_args.IntervalStrategy.

Classes

GRPOConfig([benchmarks, eval_before_train, ...])

GRPO configuration extended for MaxEnt-GRPO experiments.

GRPOScriptArguments([dataset_name, ...])

Script arguments for the GRPO training script.

class maxent_grpo.config.grpo.GRPOConfig(benchmarks=<factory>, eval_before_train=False, seed_paper_eval_enabled=False, seed_paper_eval_template=None, seed_paper_eval_tasks=None, seed_paper_eval_workspace_dir=None, seed_paper_eval_results_dir=None, seed_paper_eval_python=None, seed_paper_eval_max_test=999999, seed_paper_eval_vllm_batch_size=32, seed_paper_eval_timeout_s=14400, seed_paper_eval_pass_at_8_enabled=False, seed_paper_eval_pass_at_8_samples=8, seed_paper_eval_pass_at_8_temperature=1.0, seed_paper_eval_pass_at_8_top_p=1.0, seed_paper_eval_fail_on_error=False, final_model_save_enabled=True, callbacks=<factory>, eval_strategy=None, num_generations=8, reward_funcs=<factory>, reward_weights=<factory>, seed_paper_reward_fast=False, chat_template=None, prompt_template='no', hub_model_revision='main', reference_model_name_or_path=None, reference_model_revision=None, maxent_share_reference_model=False, maxent_reference_ema_enabled=True, maxent_reference_ema_beta=0.995, maxent_reference_ema_update_interval=10, maxent_reference_ema_warmup_steps=100, num_completions_to_print=0, overwrite_hub_revision=False, push_to_hub_revision=False, system_prompt=None, wandb_log_unique_prompts=True, wandb_entity=None, wandb_project=None, wandb_run_group=None, log_like_grpo=False, torch_compile=False, init_from_checkpoint=None, resume_from_checkpoint=None, objective='maxent_entropy', maxent_alpha=0.0, maxent_tau=0.0, maxent_q_temperature=1.0, maxent_q_epsilon=1e-06, maxent_length_normalize_ref=True, maxent_length_normalize_policy=False, maxent_logprob_chunk_size=0, maxent_reference_logprobs_source='auto', maxent_trl_reference_scoring=True, behavior_logprobs_source='model', maxent_allow_stale_reference_logprobs=False, maxent_score_tail_tokens=None, maxent_policy_entropy=False, maxent_policy_entropy_mode='exact', policy_entropy_bonus_coef=0.0, maxent_alpha_raise_on_low_kl=False, maxent_alpha_lower_on_high_kl=False, maxent_alpha_kl_threshold=0.04, maxent_alpha_kl_gain=1.0, maxent_alpha_kl_max_multiplier=2.0, maxent_alpha_kl_min_multiplier=0.5, maxent_alpha_disable_outside_trust_zone=False, maxent_score_slice_prefetch=0, maxent_prompt_cache_size=10000, maxent_target_weight_entropy=None, maxent_target_weight_entropy_start=None, maxent_target_weight_entropy_final=None, maxent_target_weight_entropy_horizon=0, maxent_tau_lr=0.0, maxent_tau_min=0.0, maxent_tau_max=0.0, maxent_tau_warmup_steps=-1, controller_meta_enabled=False, controller_meta_method='analytic', controller_meta_lr=0.0, controller_meta_tau_lr=0.0, controller_meta_beta_lr=0.0, controller_meta_beta_grad_clip=0.0, controller_meta_update_interval=1, maxent_use_clip_objective=False, maxent_clip_objective_coef=1.0, maxent_clip_adv_baseline=None, controller_overwrite_from_config=False, scale_rewards=True, seed_grpo_enabled=False, seed_grpo_alpha=0.0417, seed_grpo_alpha_normalize_by_max_entropy=True, seed_grpo_length_normalize_logprobs=True, maxent_allow_empty_weight_fallback=False, maxent_listwise_skip_zero_variance_groups=False, maxent_clip_range=None, clip_range=0.0, clip_range_high=None, clip_delta=None, grpo_loss_type='bnpo', dr_grpo_denominator_mode='fixed_max', missing_boxed_answer_penalty=-0.05, beta=None, kl_target=0.0, kl_horizon=0, kl_ctl_step_size=0.0, grpo_beta_controller_enabled=False, maxent_beta_controller_enabled=False, gen_temperature=0.8, gen_top_p=0.9, greedy_eval_enabled=False, eval_greedy_only_enabled=False, truncate_completions_at_first_boxed_answer=False, vllm_mode='server', vllm_url='http://localhost:8000/generate', vllm_max_completion_rounds=0, vllm_retry_sleep=0.5, vllm_backfill_with_model=True, vllm_return_logprobs=True, vllm_force_logprobs=False, vllm_logprob_fail_after=None, vllm_logprob_fallback=None, vllm_client_tag_fail_fast=None, vllm_sync_interval_steps=1, vllm_sync_weights=False, vllm_best_of=None, vllm_frequency_penalty=0.0, vllm_presence_penalty=0.0, vllm_top_k=None, vllm_stop_sequences=None, vllm_include_stop_str_in_output=None, zero_truncated_completion_rewards=False, dataloader_num_workers=0, dataloader_pin_memory=None, dataloader_prefetch_factor=None, dataloader_persistent_workers=None, disable_distributed_sampler=False, log_level='warning', log_completions=False, rich_log_completions=False, rich_log_completions_key='rich_completions', rich_log_completions_to_wandb=False, rich_log_completions_synchronize_ranks=True)[source]

Bases: GRPOConfig

GRPO configuration extended for MaxEnt-GRPO experiments.

Adds logging hooks, benchmark orchestration, weighting controls, and vLLM integration options on top of trl.GRPOConfig.

Variables:
  • benchmarks – Benchmarks to run after training completes.

  • callbacks – Callback identifiers executed during training.

  • chat_template – Optional legacy chat template string used to render prompts.

  • prompt_template – Training-side prompt template selector. Supports the official SEED-GRPO templates qwen_math, no, and r1.

  • hub_model_revision – Hub model branch to push artifacts to.

  • num_completions_to_print – Number of completions to print for inspection.

  • overwrite_hub_revision – Whether to overwrite the destination Hub branch.

  • push_to_hub_revision – Whether to push training outputs to the Hub.

  • system_prompt – System prompt injected before training prompts.

  • wandb_log_unique_prompts – Log unique prompts to W&B runs.

  • wandb_entity – W&B entity/organization for run tracking.

  • wandb_project – W&B project name.

  • wandb_run_group – W&B group used to cluster related runs.

  • objective – Canonical objective selector. grpo keeps the native GRPO loss, grpo_entropy_bonus adds a reward-side entropy bonus, maxent_entropy applies entropy regularization in the loss, and maxent_listwise uses tau/q/beta listwise weighting.

  • maxent_alpha – Coefficient for the entropy-regularized MaxEnt loss.

  • maxent_tau – Sequence-level entropy weight used by MaxEnt-GRPO.

  • maxent_q_temperature – Temperature applied when forming listwise q values.

  • maxent_q_epsilon – Minimum support added to q before normalization.

  • maxent_length_normalize_ref – Length-normalize reference log-probs.

  • maxent_length_normalize_policy – Length-normalize policy/behavior sequence log-probs inside the listwise loss.

  • maxent_logprob_chunk_size – Mini-batch size when computing log-probs.

  • maxent_policy_entropy – Whether to compute policy entropy during scoring.

  • maxent_policy_entropy_mode – Which entropy estimator to use (“exact” or “sample”). sample is only valid for logging or GRPO reward-side entropy bonuses; entropy-regularized MaxEnt loss requires exact.

  • policy_entropy_bonus_coef – Coefficient applied to per-token policy entropy when adding an entropy bonus to rewards (GRPO + entropy bonus).

  • maxent_alpha_raise_on_low_kl – When true, allow adaptive MaxEnt alpha increases while measured train KL remains below maxent_alpha_kl_threshold.

  • maxent_alpha_lower_on_high_kl – When true, allow adaptive MaxEnt alpha decreases while measured train KL remains above maxent_alpha_kl_threshold.

  • maxent_alpha_kl_threshold – KL boundary used by adaptive alpha control.

  • maxent_alpha_kl_gain – Gain for KL-based alpha scaling in both directions.

  • maxent_alpha_kl_max_multiplier – Upper bound on the adaptive alpha multiplier.

  • maxent_alpha_kl_min_multiplier – Lower bound on the adaptive alpha multiplier.

  • maxent_alpha_disable_outside_trust_zone – When true, disable the entropy bonus on batches whose measured KL is non-finite or exceeds maxent_alpha_kl_threshold.

  • maxent_reference_ema_enabled – When true, softly update the frozen TRL reference model from policy weights during training.

  • maxent_reference_ema_beta – EMA momentum for reference updates. Higher values move the reference model more slowly.

  • maxent_reference_ema_update_interval – Apply the EMA update every N train steps.

  • maxent_reference_ema_warmup_steps – Number of train steps to wait before EMA reference updates begin.

  • behavior_logprobs_source – Source for behavior-policy log-probs used in PPO ratios.

  • maxent_target_weight_entropy – Target weight entropy for automatic tau tuning.

  • maxent_target_weight_entropy_start – Optional starting entropy target for annealing.

  • maxent_target_weight_entropy_final – Optional final entropy target for annealing.

  • maxent_target_weight_entropy_horizon – Steps to interpolate between start/final targets.

  • maxent_tau_lr – Learning rate applied during tau adaptation.

  • maxent_tau_min – Lower bound enforced on tau during tuning.

  • maxent_tau_max – Upper bound enforced on tau during tuning.

  • maxent_tau_warmup_steps – Warmup steps before enabling tau adaptation.

  • maxent_use_clip_objective – Blend a PPO-style clipped objective into the loss.

  • maxent_clip_objective_coef – Scale for the clipped objective component.

  • maxent_clip_adv_baseline – Baseline subtracted before clipping.

  • scale_rewards – Whether to scale GRPO advantages by group std (TRL default).

  • seed_grpo_enabled – Enable SEED-GRPO semantic-entropy advantage scaling.

  • seed_grpo_alpha – Base alpha used by the official SEED-GRPO implementation.

  • seed_grpo_alpha_normalize_by_max_entropy – Divide seed_grpo_alpha by log(num_generations) to match the official SEED-GRPO code path.

  • seed_grpo_length_normalize_logprobs – Use mean token log-probabilities when computing SEED-GRPO semantic entropy.

  • maxent_clip_range – Override PPO clip range for the MaxEnt objective.

  • beta – Optional alias for the KL coefficient used by legacy GRPO/Open-R1 configs. When provided, it is mirrored to init_kl_coeff.

  • kl_target – Target KL value for automatic beta adjustment.

  • kl_horizon – Horizon in optimizer steps for the beta controller.

  • kl_ctl_step_size – Maximum fractional beta change per controller step.

  • maxent_beta_controller_enabled – Enable adaptive beta updates for MaxEnt objectives. Keep false for a fixed-beta comparison.

  • clip_range – PPO clip range used for clipping ratios in training loss.

  • clip_range_high – Upper PPO clip range (epsilon_high) for asymmetric clipping.

  • clip_delta – Optional additional slack for two-sided clipping.

  • grpo_loss_type – GRPO loss aggregation (“grpo”, “bnpo”, or “dr_grpo”).

  • dr_grpo_denominator_mode – Denominator used by Dr.GRPO. fixed_max matches the original batch_size * max_completion_length normalization, while active_tokens normalizes by the realized number of completion tokens.

  • missing_boxed_answer_penalty – Auxiliary penalty applied by missing_boxed_answer_penalty_math when a completion never emits a boxed final answer. Keep non-positive.

  • gen_temperature – Temperature used for candidate generation.

  • gen_top_p – Top-p nucleus sampling used for generation.

  • greedy_eval_enabled – Run an additional deterministic greedy eval pass alongside the existing sampled eval metrics.

  • eval_greedy_only_enabled – Replace the sampled in-training eval rollout path with a single greedy completion per prompt. This is much lighter than pass@k eval and is intended for frequent training-time checks.

  • truncate_completions_at_first_boxed_answer – When true, trim sampled or greedy completions at the first syntactically valid \boxed{...} or \fbox{...} answer before rewards, logging, and loss consume them.

  • vllm_mode – vLLM backend mode (“server” or “colocate”).

  • vllm_url – Base URL for the vLLM /generate endpoint.

  • vllm_max_completion_rounds – Maximum number of retries to top off completions.

  • vllm_retry_sleep – Seconds to sleep between vLLM retries.

  • vllm_backfill_with_model – Fallback to local model.generate when vLLM misses completions.

  • vllm_return_logprobs – Request per-token logprobs from vLLM.

  • vllm_logprob_fail_after – Consecutive steps with missing vLLM logprobs before aborting (0 disables).

  • vllm_logprob_fallback – When true, switch reference logprobs to the model after missing vLLM logprobs.

  • vllm_client_tag_fail_fast – When true, abort vLLM retries immediately on client_tag mismatch.

  • vllm_sync_interval_steps – Only sync weights every N optimizer steps when using vLLM sync.

  • vllm_best_of – vLLM best_of parameter forwarded from TRL.

  • vllm_frequency_penalty – Frequency penalty applied during sampling.

  • vllm_presence_penalty – Presence penalty applied during sampling.

  • vllm_top_k – Top-k sampling parameter forwarded to vLLM.

  • vllm_stop_sequences – Stop sequences for vLLM (JSON list or '||'-delimited string).

  • vllm_include_stop_str_in_output – Preserve matched stop strings in the returned vLLM text. OAT enables this for the r1 math template so answer-tag grading still sees </answer>.

  • seed_paper_reward_fast – Use the OAT/SEED “fast” math verifier path for online binary math rewards. When false, the grader also runs the slower math_verify-style fallback.

  • zero_truncated_completion_rewards – Zero sequence rewards for completions that hit the generation length cap, matching OAT’s math RL actor behavior.

  • eval_before_train – Run evaluation once before training begins (step 0).

  • disable_distributed_sampler – Disable the DistributedSampler to avoid double sharding.

  • dataloader_num_workers – Number of worker processes for the training dataloader.

  • dataloader_pin_memory – Whether to pin memory in the training dataloader.

  • dataloader_prefetch_factor – Prefetch factor per worker (only when num_workers > 0).

  • dataloader_persistent_workers – Keep DataLoader workers alive between epochs.

Raises:

ValueError – If validation detects negative or inconsistent hyperparameters.

Parameters:
  • benchmarks (list[str])

  • eval_before_train (bool)

  • seed_paper_eval_enabled (bool)

  • seed_paper_eval_template (str | None)

  • seed_paper_eval_tasks (str | None)

  • seed_paper_eval_workspace_dir (str | None)

  • seed_paper_eval_results_dir (str | None)

  • seed_paper_eval_python (str | None)

  • seed_paper_eval_max_test (int)

  • seed_paper_eval_vllm_batch_size (int)

  • seed_paper_eval_timeout_s (int)

  • seed_paper_eval_pass_at_8_enabled (bool)

  • seed_paper_eval_pass_at_8_samples (int)

  • seed_paper_eval_pass_at_8_temperature (float)

  • seed_paper_eval_pass_at_8_top_p (float)

  • seed_paper_eval_fail_on_error (bool)

  • final_model_save_enabled (bool)

  • callbacks (list[str])

  • eval_strategy (str | None)

  • num_generations (int)

  • reward_funcs (list[str])

  • reward_weights (list[float])

  • seed_paper_reward_fast (bool)

  • chat_template (str | None)

  • prompt_template (str | None)

  • hub_model_revision (str | None)

  • reference_model_name_or_path (str | None)

  • reference_model_revision (str | None)

  • maxent_share_reference_model (bool)

  • maxent_reference_ema_enabled (bool)

  • maxent_reference_ema_beta (float)

  • maxent_reference_ema_update_interval (int)

  • maxent_reference_ema_warmup_steps (int)

  • num_completions_to_print (int)

  • overwrite_hub_revision (bool)

  • push_to_hub_revision (bool)

  • system_prompt (str | None)

  • wandb_log_unique_prompts (bool)

  • wandb_entity (str | None)

  • wandb_project (str | None)

  • wandb_run_group (str | None)

  • log_like_grpo (bool)

  • torch_compile (bool)

  • init_from_checkpoint (str | None)

  • resume_from_checkpoint (str | None)

  • objective (str)

  • maxent_alpha (float)

  • maxent_tau (float)

  • maxent_q_temperature (float)

  • maxent_q_epsilon (float)

  • maxent_length_normalize_ref (bool)

  • maxent_length_normalize_policy (bool)

  • maxent_logprob_chunk_size (int)

  • maxent_reference_logprobs_source (str)

  • maxent_trl_reference_scoring (bool)

  • behavior_logprobs_source (str)

  • maxent_allow_stale_reference_logprobs (bool)

  • maxent_score_tail_tokens (int | None)

  • maxent_policy_entropy (bool)

  • maxent_policy_entropy_mode (str)

  • policy_entropy_bonus_coef (float)

  • maxent_alpha_raise_on_low_kl (bool)

  • maxent_alpha_lower_on_high_kl (bool)

  • maxent_alpha_kl_threshold (float)

  • maxent_alpha_kl_gain (float)

  • maxent_alpha_kl_max_multiplier (float)

  • maxent_alpha_kl_min_multiplier (float)

  • maxent_alpha_disable_outside_trust_zone (bool)

  • maxent_score_slice_prefetch (int)

  • maxent_prompt_cache_size (int)

  • maxent_target_weight_entropy (float | None)

  • maxent_target_weight_entropy_start (float | None)

  • maxent_target_weight_entropy_final (float | None)

  • maxent_target_weight_entropy_horizon (int)

  • maxent_tau_lr (float)

  • maxent_tau_min (float)

  • maxent_tau_max (float)

  • maxent_tau_warmup_steps (int)

  • controller_meta_enabled (bool)

  • controller_meta_method (str)

  • controller_meta_lr (float)

  • controller_meta_tau_lr (float)

  • controller_meta_beta_lr (float)

  • controller_meta_beta_grad_clip (float)

  • controller_meta_update_interval (int)

  • maxent_use_clip_objective (bool)

  • maxent_clip_objective_coef (float)

  • maxent_clip_adv_baseline (float | None)

  • controller_overwrite_from_config (bool)

  • scale_rewards (bool)

  • seed_grpo_enabled (bool)

  • seed_grpo_alpha (float)

  • seed_grpo_alpha_normalize_by_max_entropy (bool)

  • seed_grpo_length_normalize_logprobs (bool)

  • maxent_allow_empty_weight_fallback (bool)

  • maxent_listwise_skip_zero_variance_groups (bool)

  • maxent_clip_range (float | None)

  • clip_range (float)

  • clip_range_high (float | None)

  • clip_delta (float | None)

  • grpo_loss_type (str)

  • dr_grpo_denominator_mode (str)

  • missing_boxed_answer_penalty (float)

  • beta (float | None)

  • kl_target (float)

  • kl_horizon (int)

  • kl_ctl_step_size (float)

  • grpo_beta_controller_enabled (bool)

  • maxent_beta_controller_enabled (bool)

  • gen_temperature (float)

  • gen_top_p (float)

  • greedy_eval_enabled (bool)

  • eval_greedy_only_enabled (bool)

  • truncate_completions_at_first_boxed_answer (bool)

  • vllm_mode (str)

  • vllm_url (str | None)

  • vllm_max_completion_rounds (int)

  • vllm_retry_sleep (float)

  • vllm_backfill_with_model (bool)

  • vllm_return_logprobs (bool)

  • vllm_force_logprobs (bool)

  • vllm_logprob_fail_after (int | None)

  • vllm_logprob_fallback (bool | None)

  • vllm_client_tag_fail_fast (bool | None)

  • vllm_sync_interval_steps (int | None)

  • vllm_sync_weights (bool)

  • vllm_best_of (int | None)

  • vllm_frequency_penalty (float)

  • vllm_presence_penalty (float)

  • vllm_top_k (int | None)

  • vllm_stop_sequences (str | None)

  • vllm_include_stop_str_in_output (bool | None)

  • zero_truncated_completion_rewards (bool)

  • dataloader_num_workers (int)

  • dataloader_pin_memory (bool | None)

  • dataloader_prefetch_factor (int | None)

  • dataloader_persistent_workers (bool | None)

  • disable_distributed_sampler (bool)

  • log_level (str | int)

  • log_completions (bool)

  • rich_log_completions (bool)

  • rich_log_completions_key (str)

  • rich_log_completions_to_wandb (bool)

  • rich_log_completions_synchronize_ranks (bool)

benchmarks: list[str]
eval_before_train: bool = False
seed_paper_eval_enabled: bool = False
seed_paper_eval_template: str | None = None
seed_paper_eval_tasks: str | None = None
seed_paper_eval_workspace_dir: str | None = None
seed_paper_eval_results_dir: str | None = None
seed_paper_eval_python: str | None = None
seed_paper_eval_max_test: int = 999999
seed_paper_eval_vllm_batch_size: int = 32
seed_paper_eval_timeout_s: int = 14400
seed_paper_eval_pass_at_8_enabled: bool = False
seed_paper_eval_pass_at_8_samples: int = 8
seed_paper_eval_pass_at_8_temperature: float = 1.0
seed_paper_eval_pass_at_8_top_p: float = 1.0
seed_paper_eval_fail_on_error: bool = False
final_model_save_enabled: bool = True
callbacks: list[str]
eval_strategy: str | None = None
num_generations: int = 8
reward_funcs: list[str]
reward_weights: list[float]
seed_paper_reward_fast: bool = False
chat_template: str | None = None
prompt_template: str | None = 'no'
hub_model_revision: str | None = 'main'
reference_model_name_or_path: str | None = None
reference_model_revision: str | None = None
maxent_share_reference_model: bool = False
maxent_reference_ema_enabled: bool = True
maxent_reference_ema_beta: float = 0.995
maxent_reference_ema_update_interval: int = 10
maxent_reference_ema_warmup_steps: int = 100
num_completions_to_print: int = 0
overwrite_hub_revision: bool = False
push_to_hub_revision: bool = False
system_prompt: str | None = None
wandb_log_unique_prompts: bool = True
wandb_entity: str | None = None
wandb_project: str | None = None
wandb_run_group: str | None = None
log_like_grpo: bool = False
torch_compile: bool = False
init_from_checkpoint: str | None = None
resume_from_checkpoint: str | None = None
objective: str = 'maxent_entropy'
maxent_alpha: float = 0.0
maxent_objective_variant: str = 'entropy'
maxent_tau: float = 0.0
maxent_q_temperature: float = 1.0
maxent_q_epsilon: float = 1e-06
maxent_length_normalize_ref: bool = True
maxent_length_normalize_policy: bool = False
maxent_logprob_chunk_size: int = 0
maxent_reference_logprobs_source: str = 'auto'
maxent_trl_reference_scoring: bool = True
behavior_logprobs_source: str = 'model'
maxent_allow_stale_reference_logprobs: bool = False
maxent_score_tail_tokens: int | None = None
maxent_policy_entropy: bool = False
maxent_policy_entropy_mode: str = 'exact'
policy_entropy_bonus_coef: float = 0.0
maxent_alpha_raise_on_low_kl: bool = False
maxent_alpha_lower_on_high_kl: bool = False
maxent_alpha_kl_threshold: float = 0.04
maxent_alpha_kl_gain: float = 1.0
maxent_alpha_kl_max_multiplier: float = 2.0
maxent_alpha_kl_min_multiplier: float = 0.5
maxent_alpha_disable_outside_trust_zone: bool = False
maxent_score_slice_prefetch: int = 0
maxent_prompt_cache_size: int = 10000
maxent_target_weight_entropy: float | None = None
maxent_target_weight_entropy_start: float | None = None
maxent_target_weight_entropy_final: float | None = None
maxent_target_weight_entropy_horizon: int = 0
maxent_tau_lr: float = 0.0
maxent_tau_min: float = 0.0
maxent_tau_max: float = 0.0
maxent_tau_warmup_steps: int = -1
controller_meta_enabled: bool = False
controller_meta_method: str = 'analytic'
controller_meta_lr: float = 0.0
controller_meta_tau_lr: float = 0.0
controller_meta_beta_lr: float = 0.0
controller_meta_beta_grad_clip: float = 0.0
controller_meta_update_interval: int = 1
controller_meta_objective: str = 'potential'
controller_meta_analytic_steps: int = 1
controller_meta_optimizer: str = 'sgd'
controller_meta_truncation_steps: int = 1
controller_meta_use_hessian: bool = False
maxent_use_clip_objective: bool = False
maxent_clip_objective_coef: float = 1.0
maxent_clip_adv_baseline: float | None = None
controller_overwrite_from_config: bool = False
train_grpo_objective: bool = False
scale_rewards: bool = True
seed_grpo_enabled: bool = False
seed_grpo_alpha: float = 0.0417
seed_grpo_alpha_normalize_by_max_entropy: bool = True
seed_grpo_length_normalize_logprobs: bool = True
maxent_allow_empty_weight_fallback: bool = False
maxent_listwise_skip_zero_variance_groups: bool = False
maxent_clip_range: float | None = None
clip_range: float = 0.0
clip_range_high: float | None = None
clip_delta: float | None = None
grpo_loss_type: str = 'bnpo'
dr_grpo_denominator_mode: str = 'fixed_max'
missing_boxed_answer_penalty: float = -0.05
beta: float | None = None
kl_target: float = 0.0
kl_horizon: int = 0
kl_ctl_step_size: float = 0.0
grpo_beta_controller_enabled: bool = False
maxent_beta_controller_enabled: bool = False
gen_temperature: float = 0.8
gen_top_p: float = 0.9
greedy_eval_enabled: bool = False
eval_greedy_only_enabled: bool = False
truncate_completions_at_first_boxed_answer: bool = False
vllm_mode: str = 'server'
vllm_url: str | None = 'http://localhost:8000/generate'
vllm_max_completion_rounds: int = 0
vllm_retry_sleep: float = 0.5
vllm_backfill_with_model: bool = True
vllm_return_logprobs: bool = True
vllm_force_logprobs: bool = False
vllm_logprob_fail_after: int | None = None
vllm_logprob_fallback: bool | None = None
vllm_client_tag_fail_fast: bool | None = None
vllm_sync_interval_steps: int | None = 1
vllm_sync_weights: bool = False
vllm_best_of: int | None = None
vllm_frequency_penalty: float = 0.0
vllm_presence_penalty: float = 0.0
vllm_top_k: int | None = None
vllm_stop_sequences: str | None = None
vllm_include_stop_str_in_output: bool | None = None
zero_truncated_completion_rewards: bool = False
dataloader_num_workers: int = 0
dataloader_pin_memory: bool | None = None
dataloader_prefetch_factor: int | None = None
dataloader_persistent_workers: bool | None = None
disable_distributed_sampler: bool = False
log_level: str | int = 'warning'
log_completions: bool = False
rich_log_completions: bool = False
rich_log_completions_key: str = 'rich_completions'
rich_log_completions_to_wandb: bool = False
rich_log_completions_synchronize_ranks: bool = True
get_process_log_level()[source]

Return the numeric log level honoring the configured overrides.

Return type:

int

class maxent_grpo.config.grpo.GRPOScriptArguments(dataset_name=None, dataset_mixture=None, dataset_config=None, eval_reward_funcs=<factory>, eval_reward_weights=<factory>, cosine_min_value_wrong=0.0, cosine_max_value_wrong=-0.5, cosine_min_value_correct=0.5, cosine_max_value_correct=1.0, cosine_max_len=1000, repetition_n_grams=3, repetition_max_penalty=-1.0, dataset_prompt_column='problem', dataset_solution_column='answer', eval_dataset_name=None, eval_dataset_config=None, eval_dataset_split='validation', eval_dataset_prompt_column=None, eval_dataset_solution_column=None, max_completion_len=16384, soft_punish_cache=4096, span_kl_target=0.05, span_kl_beta0=0.12, span_kl_horizon=10000)[source]

Bases: ScriptArguments

Script arguments for the GRPO training script.

Extends ScriptArguments with reward, dataset, and evaluation knobs used by MaxEnt-GRPO training pipelines.

Variables:
  • cosine_min_value_wrong – Minimum reward when the answer is wrong.

  • cosine_max_value_wrong – Maximum reward when the answer is wrong.

  • cosine_min_value_correct – Minimum reward for correct answers.

  • cosine_max_value_correct – Maximum reward for correct answers.

  • cosine_max_len – Maximum length considered when scaling cosine reward.

  • repetition_n_grams – N-gram size for repetition penalty rewards.

  • repetition_max_penalty – Maximum negative penalty for repetition rewards.

  • dataset_prompt_column – Column used as prompts during training.

  • dataset_solution_column – Column containing the reference solution.

  • eval_dataset_name – Dataset to use for evaluation when different from training.

  • eval_dataset_config – Config name for the evaluation dataset.

  • eval_dataset_split – Split to read from the evaluation dataset.

  • eval_dataset_prompt_column – Prompt column for the evaluation dataset.

  • eval_dataset_solution_column – Solution column for the evaluation dataset.

  • max_completion_len – Maximum completion length in characters.

  • soft_punish_cache – Minimum completion length before applying a soft penalty.

  • span_kl_target – Per-token KL target used by the span KL controller.

  • span_kl_beta0 – Initial KL coefficient for span KL regularization.

  • span_kl_horizon – Horizon (steps) for the span KL controller.

Parameters:
  • dataset_name (str | None)

  • dataset_mixture (DatasetMixtureConfig | None)

  • dataset_config (str | None)

  • eval_reward_funcs (list[str])

  • eval_reward_weights (list[float])

  • cosine_min_value_wrong (float)

  • cosine_max_value_wrong (float)

  • cosine_min_value_correct (float)

  • cosine_max_value_correct (float)

  • cosine_max_len (int)

  • repetition_n_grams (int)

  • repetition_max_penalty (float)

  • dataset_prompt_column (str)

  • dataset_solution_column (str)

  • eval_dataset_name (str | None)

  • eval_dataset_config (str | None)

  • eval_dataset_split (str)

  • eval_dataset_prompt_column (str | None)

  • eval_dataset_solution_column (str | None)

  • max_completion_len (int)

  • soft_punish_cache (int)

  • span_kl_target (float)

  • span_kl_beta0 (float)

  • span_kl_horizon (int)

eval_reward_funcs: list[str]
eval_reward_weights: list[float]
cosine_min_value_wrong: float = 0.0
cosine_max_value_wrong: float = -0.5
cosine_min_value_correct: float = 0.5
cosine_max_value_correct: float = 1.0
cosine_max_len: int = 1000
repetition_n_grams: int = 3
repetition_max_penalty: float = -1.0
dataset_prompt_column: str = 'problem'
dataset_solution_column: str = 'answer'
eval_dataset_name: str | None = None
eval_dataset_config: str | None = None
eval_dataset_split: str = 'validation'
eval_dataset_prompt_column: str | None = None
eval_dataset_solution_column: str | None = None
max_completion_len: int = 16384
soft_punish_cache: int = 4096
span_kl_target: float = 0.05
span_kl_beta0: float = 0.12
span_kl_horizon: int = 10000