maxent_grpo.core.evaluation

LightEval task registration and Slurm launch utilities.

This module provides helpers to:

  • Define a compact string specification per benchmark task and register common tasks in a dictionary consumable by launchers.

  • Compute the proper vLLM Slurm submission command and spawn evaluations as jobs using subprocess.run.

It also exposes SUPPORTED_BENCHMARKS and convenience functions to list registered tasks. vLLM launch on Slurm requires a specific environment bootstrap (see VLLM_SLURM_PREFIX) to source system profiles and set $HOME.

Functions

_build_slurm_gpu_flag(num_gpus)

Return sbatch GPU flag(s) based on environment policy.

get_lighteval_tasks()

Return the list of registered LightEval task names.

register_lighteval_task(configs, eval_suite, ...)

Register a LightEval task configuration in configs.

run_benchmark_jobs(training_args, model_args)

Launch one or more benchmarks as Slurm jobs.

run_lighteval_job(benchmark, training_args, ...)

Launch a LightEval job under Slurm with vLLM decoding.

maxent_grpo.core.evaluation.register_lighteval_task(configs, eval_suite, task_name, task_list, num_fewshot=0)[source]

Register a LightEval task configuration in configs.

Parameters:
  • configs (dict[str, str]) – Mapping where the serialized task spec is stored; mutated in place with the new task_name entry.

  • eval_suite (str) – Suite prefix, e.g. "lighteval" or "extended". This is prepended to every task in task_list.

  • task_name (str) – Key under which the composed task specification is stored.

  • task_list (str) – Comma-separated list of task identifiers without suite prefix. Each entry is expanded into {eval_suite}|{task}|{num_fewshot}|0.

  • num_fewshot (int) – Number of few-shot examples per task (defaults to zero). :type num_fewshot: int

Returns:

None. configs is updated directly.

Return type:

None

maxent_grpo.core.evaluation.get_lighteval_tasks()[source]

Return the list of registered LightEval task names.

Returns:

Available benchmark keys currently registered in LIGHTEVAL_TASKS.

Return type:

list[str]

maxent_grpo.core.evaluation.run_lighteval_job(benchmark, training_args, model_args)[source]

Launch a LightEval job under Slurm with vLLM decoding.

The job command is composed from VLLM_SLURM_PREFIX and a generated task list. For models with >=30B parameters the function enables tensor parallelism; otherwise it defaults to two GPUs to reduce cluster pressure. If system_prompt is provided it is base64-encoded to avoid quoting issues in the Slurm script.

Parameters:
  • benchmark (str) – Registered benchmark key to execute.

  • training_args (GRPOConfig) – Training configuration providing Hub identifiers and the optional system_prompt for evaluation.

  • model_args (ModelConfig) – Model configuration controlling trust flags for remote code and general model loading options.

Returns:

None. A subprocess is spawned for the Slurm submission.

Return type:

None

Raises:
maxent_grpo.core.evaluation.run_benchmark_jobs(training_args, model_args)[source]

Launch one or more benchmarks as Slurm jobs.

When the CLI requests benchmarks=["all"] the function expands this into every registered LightEval task. Each benchmark is delegated to run_lighteval_job with the provided arguments.

Parameters:
  • training_args (GRPOConfig) – Training configuration whose benchmarks field enumerates the tasks to run (or the sentinel "all").

  • model_args (ModelConfig) – Model configuration forwarded to run_lighteval_job.

Returns:

None. Jobs are submitted sequentially.

Return type:

None

Raises:

ValueError – If an unknown benchmark name is supplied.