maxent_grpo.training.runtime

Runtime utilities split by concern for the MaxEnt-GRPO training stack.

This package separates setup/dependency loading, logging, and prompt handling so callers can import only what they need without pulling the full helper module.

maxent_grpo.training.runtime.log_run_header(training_args=None)[source]

Log a consistent run header with recipe and resolved method identity.

Parameters:

training_args (Any | None) – Optional training config used to resolve metadata.

Returns:

Metadata dictionary emitted to the logs.

Return type:

dict[str, str]

maxent_grpo.training.runtime.resolve_run_metadata(training_args=None)[source]

Return run-level metadata for logging consistency.

Parameters:

training_args (Any | None) – Optional training config used to read recipe/method fields.

Returns:

Mapping with git SHA, recipe path, and resolved method identity.

Return type:

dict[str, str]

class maxent_grpo.training.runtime.ChatTokenizer(*args, **kwargs)[source]

Bases: Protocol

Protocol for tokenizers with chat template capabilities.

apply_chat_template(conversation, tokenize=True, add_generation_prompt=True)[source]

Render a conversation into a model-ready prompt.

Parameters:
Return type:

str | List[int]

property eos_token_id: int | None

Expose the EOS token id used by the tokenizer.

class maxent_grpo.training.runtime.GenerationPenaltyPassthroughMixin[source]

Bases: object

Expose penalty overrides via legacy gen_* accessors.

penalty: GenerationPenaltyConfig
property gen_top_k: int | None

Backward-compatible alias for the top-k sampling limit.

property gen_best_of: int | None

Backward-compatible alias for the best-of sampling count.

property gen_frequency_penalty: float

Backward-compatible alias for the frequency penalty strength.

property gen_presence_penalty: float

Backward-compatible alias for the presence penalty strength.

property gen_stop_sequences: List[str] | None

Backward-compatible alias for stop sequences.

class maxent_grpo.training.runtime.GenerationSamplingConfig(max_prompt_len, max_completion_len, gen_temperature, gen_top_p, use_vllm, vllm, *, vllm_mode='server')[source]

Bases: object

Shared completion sampling knobs (HF + vLLM).

Parameters:
max_prompt_len: int
max_completion_len: int
gen_temperature: float
gen_top_p: float
use_vllm: bool
vllm: VLLMClientConfig
vllm_mode: str = 'server'
property vllm_url: str

Backward-compatible accessor for the vLLM endpoint URL.

property vllm_rounds_cfg: int

Backward-compatible accessor for the maximum vLLM retry rounds.

property vllm_retry_sleep: float

Backward-compatible accessor for the per-round retry sleep.

property vllm_backfill_local: bool

Backward-compatible accessor for local fallback behavior.

property vllm_request_logprobs: bool

Backward-compatible accessor for whether to request logprobs.

property vllm_best_of: int | None

Backward-compatible accessor for the best-of sampling count.

property vllm_frequency_penalty: float

Backward-compatible accessor for the frequency penalty value.

property vllm_presence_penalty: float

Backward-compatible accessor for the presence penalty value.

property vllm_top_k: int | None

Backward-compatible accessor for the top-k sampling limit.

property vllm_stop_sequences: List[str] | None

Backward-compatible accessor for stop sequences.

property vllm_include_stop_str_in_output: bool

Whether vLLM should preserve matched stop strings in output text.

property vllm_timeout: float

Backward-compatible accessor for request timeout.

property vllm_max_retries: int

Backward-compatible accessor for maximum request retries.

property vllm_backoff: float

Backward-compatible accessor for exponential backoff factor.

property vllm_backoff_multiplier: float

Multiplier applied to the backoff delay after each attempt.

property vllm_guided_json: str | None

Backward-compatible accessor for JSON schema-guided decoding.

property vllm_guided_regex: str | None

Backward-compatible accessor for regex-guided decoding.

property vllm_logit_bias: Dict[str, float] | None

Backward-compatible accessor for logit bias configuration.

property vllm_request_id_prefix: str | None

Backward-compatible accessor for request-id prefixes.

property vllm_sync_weights: bool

Whether to push model weights to the vLLM server before generation.

class maxent_grpo.training.runtime.MaxEntOptions(tau=<factory>, q_temperature=<factory>, q_epsilon=<factory>, length_normalize_ref=<factory>)[source]

Bases: object

Lightweight knobs specific to MaxEnt sequence-level updates.

Parameters:
tau: float
q_temperature: float
q_epsilon: float
length_normalize_ref: bool
maxent_grpo.training.runtime.classify_vllm_startup_log(log_text, stall_threshold=3)[source]

Classify startup progress using marker patterns in log_text.

Parameters:
  • log_text (str)

  • stall_threshold (int)

Return type:

StartupStatus

maxent_grpo.training.runtime.should_trigger_v0_fallback(log_text, attempt, min_attempts=20, stall_threshold=3)[source]

Return True when vLLM startup appears stuck and should be relaunched in V0 mode.

Parameters:
  • log_text (str)

  • attempt (int)

  • min_attempts (int)

  • stall_threshold (int)

Return type:

bool

maxent_grpo.training.runtime.get_trl_prepare_deepspeed()[source]

Return TRL’s prepare_deepspeed helper when available.

Return type:

Any | None

maxent_grpo.training.runtime.require_accelerator(context)[source]

Return accelerate.Accelerator or raise a helpful RuntimeError.

Parameters:

context (str)

Return type:

Any

maxent_grpo.training.runtime.require_dataloader(context)[source]

Return torch.utils.data.DataLoader with a descriptive error on failure.

Parameters:

context (str)

Return type:

Any

maxent_grpo.training.runtime.require_deepspeed(context, module='deepspeed')[source]

Return a DeepSpeed module import or raise a contextual RuntimeError.

Parameters:
Return type:

Any

maxent_grpo.training.runtime.require_torch(context)[source]

Return the torch module or raise a helpful RuntimeError.

Parameters:

context (str)

Return type:

Any

maxent_grpo.training.runtime.require_transformer_base_classes(context)[source]

Return (PreTrainedModel, PreTrainedTokenizer) with clear failure messages.

Parameters:

context (str)

Return type:

Tuple[Any, Any]

maxent_grpo.training.runtime.truncate_prompt(prompt, char_limit=None, *, tokenizer=None, max_tokens=None)[source]

Clamp prompt strings to a safe token length when possible.

Parameters:
  • prompt (str) – Prompt string to clamp.

  • char_limit (int | None) – Optional character limit fallback. When None the module-level PROMPT_CHAR_LIMIT is used.

  • tokenizer (Any | None) – Optional tokenizer used to enforce token limits.

  • max_tokens (int | None) – Optional token limit override (preferred when tokenizer is available).

Returns:

The original prompt when under the limit, otherwise a truncated prefix.

Return type:

str

Modules

config

Configuration dataclasses for the training runtime.

deepspeed

DeepSpeed and Accelerate integration helpers.

deps

Dependency loading utilities used by the training runtime.

logging

Logging utilities (primarily W&B) for the training stack.

ops

Runtime operational helpers.

prompts

Prompt-related helpers and sampling penalties.

setup

Setup utilities for loading runtime dependencies and accelerator plugins.