maxent_grpo.core.evaluation¶
LightEval task registration and Slurm launch utilities.
This module provides helpers to:
Define a compact string specification per benchmark task and register common tasks in a dictionary consumable by launchers.
Compute the proper vLLM Slurm submission command and spawn evaluations as jobs using
subprocess.run.
It also exposes SUPPORTED_BENCHMARKS and convenience functions to list
registered tasks. vLLM launch on Slurm requires a specific environment
bootstrap (see VLLM_SLURM_PREFIX) to source system profiles and set $HOME.
Functions
|
Return sbatch GPU flag(s) based on environment policy. |
Return the list of registered LightEval task names. |
|
|
Register a LightEval task configuration in |
|
Launch one or more benchmarks as Slurm jobs. |
|
Launch a LightEval job under Slurm with vLLM decoding. |
- maxent_grpo.core.evaluation.register_lighteval_task(configs, eval_suite, task_name, task_list, num_fewshot=0)[source]¶
Register a LightEval task configuration in
configs.Core tasks table: https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/tasks_table.jsonl
Custom tasks should live under your project (e.g.,
tasks/orops/).
- Parameters:
configs (dict[str, str]) – Mapping where the serialized task spec is stored; mutated in place with the new
task_nameentry.eval_suite (str) – Suite prefix, e.g.
"lighteval"or"extended". This is prepended to every task intask_list.task_name (str) – Key under which the composed task specification is stored.
task_list (str) – Comma-separated list of task identifiers without suite prefix. Each entry is expanded into
{eval_suite}|{task}|{num_fewshot}|0.num_fewshot (int) – Number of few-shot examples per task (defaults to zero). :type num_fewshot: int
- Returns:
None.configsis updated directly.- Return type:
None
- maxent_grpo.core.evaluation.get_lighteval_tasks()[source]¶
Return the list of registered LightEval task names.
- maxent_grpo.core.evaluation.run_lighteval_job(benchmark, training_args, model_args)[source]¶
Launch a LightEval job under Slurm with vLLM decoding.
The job command is composed from
VLLM_SLURM_PREFIXand a generated task list. For models with >=30B parameters the function enables tensor parallelism; otherwise it defaults to two GPUs to reduce cluster pressure. Ifsystem_promptis provided it is base64-encoded to avoid quoting issues in the Slurm script.- Parameters:
benchmark (str) – Registered benchmark key to execute.
training_args (GRPOConfig) – Training configuration providing Hub identifiers and the optional
system_promptfor evaluation.model_args (ModelConfig) – Model configuration controlling trust flags for remote code and general model loading options.
- Returns:
None. A subprocess is spawned for the Slurm submission.- Return type:
None
- Raises:
KeyError – If
benchmarkis not present inLIGHTEVAL_TASKS.CalledProcessError – If
sbatchreturns a non-zero exit status.
- maxent_grpo.core.evaluation.run_benchmark_jobs(training_args, model_args)[source]¶
Launch one or more benchmarks as Slurm jobs.
When the CLI requests
benchmarks=["all"]the function expands this into every registered LightEval task. Each benchmark is delegated torun_lighteval_jobwith the provided arguments.- Parameters:
training_args (GRPOConfig) – Training configuration whose
benchmarksfield enumerates the tasks to run (or the sentinel"all").model_args (ModelConfig) – Model configuration forwarded to
run_lighteval_job.
- Returns:
None. Jobs are submitted sequentially.- Return type:
None
- Raises:
ValueError – If an unknown benchmark name is supplied.