maxent_grpo.core.hub

Helpers for working with the Hugging Face Hub.

This module provides:

  • Upload utilities to push a training output directory to a dedicated branch (revision) with basic safety checks.

  • Small metadata helpers such as parameter count inference from a repo ID (via naming conventions or safetensors metadata) and choosing a valid GPU count for vLLM tensor parallelism.

Functions

check_hub_revision_exists(training_args)

Validate whether a target Hub revision exists and is safe to write.

ensure_hf_repo_ready(training_args)

Verify Hub credentials and provision the target repo/branch upfront.

get_gpu_count_for_vllm(model_name[, ...])

Choose a valid GPU count for vLLM tensor parallelism.

get_param_count_from_repo_id(repo_id)

Infer parameter count from naming conventions or Hub metadata.

push_to_hub_revision(training_args[, ...])

Push a checkpoint directory to a branch on the Hub.

maxent_grpo.core.hub.push_to_hub_revision(training_args, extra_ignore_patterns=None, *, include_checkpoints=False)[source]

Push a checkpoint directory to a branch on the Hub.

The helper will create the repository if missing, ensure the target branch exists (forked from the latest commit when possible), and upload the output_dir contents while ignoring common checkpoint artefacts. Uploads are executed asynchronously via run_as_future=True to avoid blocking training scripts.

Parameters:
  • training_args (GRPOConfig) – Training config with Hub identifiers (hub_model_id and hub_model_revision) and the local output_dir to upload.

  • include_checkpoints (bool) – When True, do not ignore checkpoint-* folders.

  • extra_ignore_patterns (list[str] | None) – Additional filename patterns to ignore during upload; appended to the default checkpoint-* and *.pth filters.

Returns:

Future that completes when the upload finishes, resolving to the Hub commit metadata.

Return type:

Future[huggingface_hub.CommitInfo]

Raises:

ValueError – If hub_model_id is not set in training_args.

maxent_grpo.core.hub.ensure_hf_repo_ready(training_args)[source]

Verify Hub credentials and provision the target repo/branch upfront.

The helper is a best-effort preflight. When Hub access is not configured (or push is disabled), it returns early. Errors in network calls are surfaced as RuntimeError to avoid silent misconfiguration.

Parameters:

training_args (GRPOConfig) – Training config with Hub identifiers and push flags.

Returns:

None. The function exits early when Hub pushes are disabled.

Return type:

None

Raises:

RuntimeError – If the Hub preflight fails due to network or auth errors.

maxent_grpo.core.hub.check_hub_revision_exists(training_args)[source]

Validate whether a target Hub revision exists and is safe to write.

The check avoids clobbering populated branches unless explicitly permitted via overwrite_hub_revision. A README in the branch is treated as a signal that the branch has content.

Parameters:

training_args (GRPOConfig) – Training config with Hub identifiers and safety flags such as push_to_hub_revision and overwrite_hub_revision.

Returns:

None. Raises if the target revision appears non-empty and overwriting is disallowed.

Return type:

None

Raises:

ValueError – If the revision exists and appears non-empty without setting overwrite_hub_revision.

maxent_grpo.core.hub.get_param_count_from_repo_id(repo_id)[source]

Infer parameter count from naming conventions or Hub metadata.

Prefers parsing strings like 42m, 1.5b or products like 8x7b from the repo ID. Falls back to safetensors metadata when no pattern is found.

Parameters:

repo_id (str | None) – Hub repository ID.

Returns:

Best guess of total parameter count, or -1 if unknown after attempting both pattern extraction and safetensors metadata lookup, or if repo_id is missing.

Return type:

int

maxent_grpo.core.hub.get_gpu_count_for_vllm(model_name, revision='main', num_gpus=8)[source]

Choose a valid GPU count for vLLM tensor parallelism.

vLLM requires that the number of attention heads and 64 are divisible by the tensor parallel size. This function decrements num_gpus until the constraints are satisfied.

Parameters:
  • model_name (str | None) – Model repository ID used to fetch the AutoConfig.

  • revision (str | None) – Repo revision/branch to inspect.

  • num_gpus (int) – Starting number of GPUs available; decremented until the constraints are satisfied.

Returns:

A compatible number of GPUs for vLLM tensor parallelism.

Return type:

int