maxent_grpo.training.runtime¶

Runtime utilities split by concern for the MaxEnt-GRPO training stack.

This package separates setup/dependency loading, logging, and prompt handling so callers can import only what they need without pulling the full helper module.

maxent_grpo.training.runtime.log_run_header(training_args=None)[source]¶

Log a consistent run header with recipe and resolved method identity.

Parameters:: training_args (Any | None) – Optional training config used to resolve metadata.
Returns:: Metadata dictionary emitted to the logs.
Return type:: dict[str, str]

maxent_grpo.training.runtime.resolve_run_metadata(training_args=None)[source]¶

Return run-level metadata for logging consistency.

Parameters:: training_args (Any | None) – Optional training config used to read recipe/method fields.
Returns:: Mapping with git SHA, recipe path, and resolved method identity.
Return type:: dict[str, str]

class maxent_grpo.training.runtime.ChatTokenizer(*args, **kwargs)[source]¶

Bases: Protocol

Protocol for tokenizers with chat template capabilities.

apply_chat_template(conversation, tokenize=True, add_generation_prompt=True)[source]¶

Render a conversation into a model-ready prompt.

Parameters:

conversation (List[Dict[str, str]])
tokenize (bool)
add_generation_prompt (bool)

Return type:

str | List[int]

property eos_token_id: int | None¶: Expose the EOS token id used by the tokenizer.

class maxent_grpo.training.runtime.GenerationPenaltyPassthroughMixin[source]¶

Bases: object

Expose penalty overrides via legacy gen_* accessors.

penalty: GenerationPenaltyConfig¶

property gen_top_k: int | None¶: Backward-compatible alias for the top-k sampling limit.

property gen_best_of: int | None¶: Backward-compatible alias for the best-of sampling count.

property gen_frequency_penalty: float¶: Backward-compatible alias for the frequency penalty strength.

property gen_presence_penalty: float¶: Backward-compatible alias for the presence penalty strength.

property gen_stop_sequences: List[str] | None¶: Backward-compatible alias for stop sequences.

class maxent_grpo.training.runtime.GenerationSamplingConfig(max_prompt_len, max_completion_len, gen_temperature, gen_top_p, use_vllm, vllm, *, vllm_mode='server')[source]¶

Bases: object

Shared completion sampling knobs (HF + vLLM).

Parameters:

max_prompt_len (int)
max_completion_len (int)
gen_temperature (float)
gen_top_p (float)
use_vllm (bool)
vllm (VLLMClientConfig)
vllm_mode (str)

max_prompt_len: int¶

max_completion_len: int¶

gen_temperature: float¶

gen_top_p: float¶

use_vllm: bool¶

vllm: VLLMClientConfig¶

vllm_mode: str = 'server'¶

property vllm_url: str¶: Backward-compatible accessor for the vLLM endpoint URL.

property vllm_rounds_cfg: int¶: Backward-compatible accessor for the maximum vLLM retry rounds.

property vllm_retry_sleep: float¶: Backward-compatible accessor for the per-round retry sleep.

property vllm_backfill_local: bool¶: Backward-compatible accessor for local fallback behavior.

property vllm_request_logprobs: bool¶: Backward-compatible accessor for whether to request logprobs.

property vllm_best_of: int | None¶: Backward-compatible accessor for the best-of sampling count.

property vllm_frequency_penalty: float¶: Backward-compatible accessor for the frequency penalty value.

property vllm_presence_penalty: float¶: Backward-compatible accessor for the presence penalty value.

property vllm_top_k: int | None¶: Backward-compatible accessor for the top-k sampling limit.

property vllm_stop_sequences: List[str] | None¶: Backward-compatible accessor for stop sequences.

property vllm_include_stop_str_in_output: bool¶: Whether vLLM should preserve matched stop strings in output text.

property vllm_timeout: float¶: Backward-compatible accessor for request timeout.

property vllm_max_retries: int¶: Backward-compatible accessor for maximum request retries.

property vllm_backoff: float¶: Backward-compatible accessor for exponential backoff factor.

property vllm_backoff_multiplier: float¶: Multiplier applied to the backoff delay after each attempt.

property vllm_guided_json: str | None¶: Backward-compatible accessor for JSON schema-guided decoding.

property vllm_guided_regex: str | None¶: Backward-compatible accessor for regex-guided decoding.

property vllm_logit_bias: Dict[str, float] | None¶: Backward-compatible accessor for logit bias configuration.

property vllm_request_id_prefix: str | None¶: Backward-compatible accessor for request-id prefixes.

property vllm_sync_weights: bool¶: Whether to push model weights to the vLLM server before generation.

class maxent_grpo.training.runtime.MaxEntOptions(tau=<factory>, q_temperature=<factory>, q_epsilon=<factory>, length_normalize_ref=<factory>)[source]¶

Bases: object

Lightweight knobs specific to MaxEnt sequence-level updates.

Parameters:

tau (float)
q_temperature (float)
q_epsilon (float)
length_normalize_ref (bool)

tau: float¶

q_temperature: float¶

q_epsilon: float¶

length_normalize_ref: bool¶

maxent_grpo.training.runtime.classify_vllm_startup_log(log_text, stall_threshold=3)[source]¶

Classify startup progress using marker patterns in log_text.

Parameters:

log_text (str)
stall_threshold (int)

Return type:

StartupStatus

maxent_grpo.training.runtime.should_trigger_v0_fallback(log_text, attempt, min_attempts=20, stall_threshold=3)[source]¶

Return True when vLLM startup appears stuck and should be relaunched in V0 mode.

Parameters:

log_text (str)
attempt (int)
min_attempts (int)
stall_threshold (int)

Return type:

bool

maxent_grpo.training.runtime.get_trl_prepare_deepspeed()[source]¶

Return TRL’s prepare_deepspeed helper when available.

Return type:: Any | None

maxent_grpo.training.runtime.require_accelerator(context)[source]¶

Return accelerate.Accelerator or raise a helpful RuntimeError.

Parameters:: context (str)
Return type:: Any

maxent_grpo.training.runtime.require_dataloader(context)[source]¶

Return torch.utils.data.DataLoader with a descriptive error on failure.

Parameters:: context (str)
Return type:: Any

maxent_grpo.training.runtime.require_deepspeed(context, module='deepspeed')[source]¶

Return a DeepSpeed module import or raise a contextual RuntimeError.

Parameters:

context (str)
module (str)

Return type:

Any

maxent_grpo.training.runtime.require_torch(context)[source]¶

Return the torch module or raise a helpful RuntimeError.

Parameters:: context (str)
Return type:: Any

maxent_grpo.training.runtime.require_transformer_base_classes(context)[source]¶

Return (PreTrainedModel, PreTrainedTokenizer) with clear failure messages.

Parameters:: context (str)
Return type:: Tuple[Any, Any]

maxent_grpo.training.runtime.truncate_prompt(prompt, char_limit=None, *, tokenizer=None, max_tokens=None)[source]¶

Clamp prompt strings to a safe token length when possible.

Parameters:

prompt (str) – Prompt string to clamp.
char_limit (int | None) – Optional character limit fallback. When None the module-level PROMPT_CHAR_LIMIT is used.
tokenizer (Any | None) – Optional tokenizer used to enforce token limits.
max_tokens (int | None) – Optional token limit override (preferred when tokenizer is available).

Returns:

The original prompt when under the limit, otherwise a truncated prefix.

Return type:

str

Modules

`config`	Configuration dataclasses for the training runtime.
`deepspeed`	DeepSpeed and Accelerate integration helpers.
`deps`	Dependency loading utilities used by the training runtime.
`logging`	Logging utilities (primarily W&B) for the training stack.
`ops`	Runtime operational helpers.
`prompts`	Prompt-related helpers and sampling penalties.
`setup`	Setup utilities for loading runtime dependencies and accelerator plugins.