maxent_grpo.training.weighting.types

Weighting-related dataclasses shared across the MaxEnt training loop.

Classes

ControllerMetaSettings([enabled, method, ...])

Meta-controller knobs governing tau/beta adaptation.

ControllerStateSnapshot(beta, tau, tau_log, ...)

Serializable controller state describing tau/beta parameters.

KlControllerSettings(target, horizon, step_size)

Controller settings for KL regularization.

QDistributionSettings(temperature, epsilon)

Softmax temperature and smoothing for weighting.

TauSchedule(target_entropy, learning_rate, ...)

Hyperparameters controlling tau adaptation.

TorchControllerState(torch_mod, tau_init, ...)

Torch-backed parameters for tau/beta with sync helpers.

WeightLoggingView([entropy, entropy_norm, ...])

Aggregated entropy statistics for logging.

WeightNormalizationSettings(denom, len_norm_ref)

Length-normalization flag and denominator scaling.

WeightStats(weights_grouped, flat_weights, ...)

Weights per completion and entropy diagnostics.

WeightingConfigLike(*args, **kwargs)

Protocol for objects that carry controller weighting scalars.

WeightingSettings(tau, beta, normalization, ...)

Sequence weighting hyperparameters with convenience accessors.

class maxent_grpo.training.weighting.types.TorchControllerState(torch_mod, tau_init, beta_init, *, requires_grad=False)[source]

Bases: object

Torch-backed parameters for tau/beta with sync helpers.

Parameters:
  • torch_mod (Any)

  • tau_init (float)

  • beta_init (float)

  • requires_grad (bool)

enable_grad()[source]
Return type:

None

disable_grad()[source]
Return type:

None

sync_from_scalars(tau, beta)[source]
Parameters:
Return type:

None

tau_tensor(detach=False)[source]
Parameters:

detach (bool)

Return type:

Any

beta_tensor(detach=False)[source]
Parameters:

detach (bool)

Return type:

Any

parameters()[source]
Return type:

List[Any]

zero_grad()[source]
Return type:

None

class maxent_grpo.training.weighting.types.WeightingConfigLike(*args, **kwargs)[source]

Bases: Protocol

Protocol for objects that carry controller weighting scalars.

beta and tau are required. Optional attributes such as denom or train_grpo_objective may be present and are accessed via getattr.

beta: float
tau: float
class maxent_grpo.training.weighting.types.ControllerMetaSettings(enabled=False, method='analytic', learning_rate=0.0, tau_learning_rate=0.0, beta_learning_rate=0.0, beta_grad_clip=0.0, update_interval=1, objective='potential', analytic_steps=1, optimizer='sgd', truncation_steps=1, use_hessian=False, last_tau_grad=0.0, last_beta_grad=0.0)[source]

Bases: object

Meta-controller knobs governing tau/beta adaptation.

Parameters:
  • enabled (bool)

  • method (str)

  • learning_rate (float)

  • tau_learning_rate (float)

  • beta_learning_rate (float)

  • beta_grad_clip (float)

  • update_interval (int)

  • objective (str)

  • analytic_steps (int)

  • optimizer (str)

  • truncation_steps (int)

  • use_hessian (bool)

  • last_tau_grad (float)

  • last_beta_grad (float)

enabled: bool = False
method: str = 'analytic'
learning_rate: float = 0.0
tau_learning_rate: float = 0.0
beta_learning_rate: float = 0.0
beta_grad_clip: float = 0.0
update_interval: int = 1
objective: str = 'potential'
analytic_steps: int = 1
optimizer: str = 'sgd'
truncation_steps: int = 1
use_hessian: bool = False
last_tau_grad: float = 0.0
last_beta_grad: float = 0.0
to_state()[source]

Return a serializable snapshot of the meta-controller settings.

Return type:

Dict[str, Any]

apply_state(payload)[source]

Update the meta-controller settings from a serialized payload.

Parameters:

payload (Mapping[str, Any])

Return type:

None

class maxent_grpo.training.weighting.types.ControllerStateSnapshot(beta, tau, tau_log, tau_entropy_ema, meta=<factory>)[source]

Bases: object

Serializable controller state describing tau/beta parameters.

Parameters:
beta: float
tau: float
tau_log: float
tau_entropy_ema: float
meta: Dict[str, Any]
STATE_VERSION: ClassVar[int] = 1
to_dict()[source]

Serialize the snapshot to a JSON-friendly mapping.

Return type:

Dict[str, Any]

classmethod from_weighting(weighting_cfg)[source]

Build a controller snapshot from the active weighting settings.

Parameters:

weighting_cfg (WeightingConfigLike)

Return type:

ControllerStateSnapshot

classmethod from_dict(payload)[source]

Instantiate a snapshot from a serialized payload.

Parameters:

payload (Mapping[str, Any])

Return type:

ControllerStateSnapshot

apply_to_weighting(weighting_cfg)[source]

Apply the snapshot contents to a weighting configuration.

Parameters:

weighting_cfg (WeightingConfigLike)

Return type:

None

class maxent_grpo.training.weighting.types.KlControllerSettings(target, horizon, step_size)[source]

Bases: object

Controller settings for KL regularization.

Parameters:
target: float
horizon: int
step_size: float
class maxent_grpo.training.weighting.types.QDistributionSettings(temperature, epsilon)[source]

Bases: object

Softmax temperature and smoothing for weighting.

Parameters:
temperature: float
epsilon: float
class maxent_grpo.training.weighting.types.TauSchedule(target_entropy, learning_rate, minimum_value, maximum_value, warmup_steps, target_entropy_start=None, target_entropy_final=None, target_entropy_horizon=0)[source]

Bases: object

Hyperparameters controlling tau adaptation.

Parameters:
  • target_entropy (float | None)

  • learning_rate (float)

  • minimum_value (float)

  • maximum_value (float)

  • warmup_steps (int)

  • target_entropy_start (float | None)

  • target_entropy_final (float | None)

  • target_entropy_horizon (int)

target_entropy: float | None
learning_rate: float
minimum_value: float
maximum_value: float
warmup_steps: int
target_entropy_start: float | None = None
target_entropy_final: float | None = None
target_entropy_horizon: int = 0
class maxent_grpo.training.weighting.types.WeightLoggingView(entropy=0.0, entropy_norm=0.0, entropy_min=0.0, entropy_max=0.0, advantage_entropy_mean=0.0, advantage_entropy_std=0.0)[source]

Bases: object

Aggregated entropy statistics for logging.

Parameters:
entropy: float = 0.0
entropy_norm: float = 0.0
entropy_min: float = 0.0
entropy_max: float = 0.0
advantage_entropy_mean: float = 0.0
advantage_entropy_std: float = 0.0
class maxent_grpo.training.weighting.types.WeightNormalizationSettings(denom, len_norm_ref)[source]

Bases: object

Length-normalization flag and denominator scaling.

Parameters:
denom: float
len_norm_ref: bool
class maxent_grpo.training.weighting.types.WeightStats(weights_grouped, flat_weights, weight_entropy, weight_entropy_min, weight_entropy_max, advantage_entropy)[source]

Bases: object

Weights per completion and entropy diagnostics.

Parameters:
weights_grouped: List[List[float]]
flat_weights: List[float]
weight_entropy: float
weight_entropy_min: float
weight_entropy_max: float
advantage_entropy: List[float]
class maxent_grpo.training.weighting.types.WeightingSettings(tau, beta, normalization, q_distribution, tau_schedule, kl_controller, train_grpo_objective, scale_rewards=True, controller_meta=<factory>, controller_state=None, allow_empty_weight_fallback=False)[source]

Bases: object

Sequence weighting hyperparameters with convenience accessors.

Parameters:
tau: float
beta: float
normalization: WeightNormalizationSettings
q_distribution: QDistributionSettings
tau_schedule: TauSchedule
kl_controller: KlControllerSettings
train_grpo_objective: bool
scale_rewards: bool = True
controller_meta: ControllerMetaSettings
controller_state: TorchControllerState | None = None
allow_empty_weight_fallback: bool = False
property denom: float

Return the denominator used for weight normalization.

Returns:

Normalization denominator applied to weights.

Return type:

float

property len_norm_ref: bool

Return whether reference log-probs are length-normalized.

Returns:

True when reference stats are length-normalized.

Return type:

bool

property q_temperature: float

Return the q-distribution temperature.

Returns:

Temperature applied to the q-distribution softmax.

Return type:

float

property q_epsilon: float

Return the epsilon smoothing factor.

Returns:

Epsilon smoothing applied to the q-distribution.

Return type:

float

property tau_target_entropy: float | None

Return the target weight entropy.

Returns:

Desired entropy target (None to disable adaptation).

Return type:

float | None

property tau_lr: float

Return the learning rate for tau adaptation.

Returns:

Scalar learning rate for tau updates.

Return type:

float

property tau_min: float

Return the minimum tau value.

Returns:

Lower bound applied to tau.

Return type:

float

property tau_max: float

Return the maximum tau value.

Returns:

Upper bound applied to tau.

Return type:

float

property tau_warmup_steps: int

Return the tau warmup horizon.

Returns:

Number of steps used to warm up tau updates.

Return type:

int

property kl_target: float

Return the KL target.

Returns:

Desired KL divergence target.

Return type:

float

property kl_horizon: int

Return the KL controller horizon.

Returns:

Number of steps used for the KL controller horizon.

Return type:

int

property kl_ctl_step_size: float

Return the KL controller step size.

Returns:

Step size multiplier used by the KL controller.

Return type:

float