maxent_grpo.core.evaluation¶

LightEval task registration and Slurm launch utilities.

This module provides helpers to:

Define a compact string specification per benchmark task and register common tasks in a dictionary consumable by launchers.
Compute the proper vLLM Slurm submission command and spawn evaluations as jobs using subprocess.run.

It also exposes SUPPORTED_BENCHMARKS and convenience functions to list registered tasks. vLLM launch on Slurm requires a specific environment bootstrap (see VLLM_SLURM_PREFIX) to source system profiles and set $HOME.

Functions

`_build_slurm_gpu_flag`(num_gpus)	Return sbatch GPU flag(s) based on environment policy.
`get_lighteval_tasks`()	Return the list of registered LightEval task names.
`register_lighteval_task`(configs, eval_suite, ...)	Register a LightEval task configuration in `configs`.
`run_benchmark_jobs`(training_args, model_args)	Launch one or more benchmarks as Slurm jobs.
`run_lighteval_job`(benchmark, training_args, ...)	Launch a LightEval job under Slurm with vLLM decoding.

maxent_grpo.core.evaluation.register_lighteval_task(configs, eval_suite, task_name, task_list, num_fewshot=0)[source]¶

Core tasks table: https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/tasks_table.jsonl
Custom tasks should live under your project (e.g., tasks/ or ops/).

Parameters:

configs (dict[str, str]) – Mapping where the serialized task spec is stored; mutated in place with the new task_name entry.
eval_suite (str) – Suite prefix, e.g. "lighteval" or "extended". This is prepended to every task in task_list.
task_name (str) – Key under which the composed task specification is stored.
task_list (str) – Comma-separated list of task identifiers without suite prefix. Each entry is expanded into {eval_suite}|{task}|{num_fewshot}|0.
num_fewshot (int) – Number of few-shot examples per task (defaults to zero). :type num_fewshot: int

Returns:

None. configs is updated directly.

Return type:

None

maxent_grpo.core.evaluation.get_lighteval_tasks()[source]¶

Return the list of registered LightEval task names.

Returns:: Available benchmark keys currently registered in LIGHTEVAL_TASKS.
Return type:: list[str]

maxent_grpo.core.evaluation.run_lighteval_job(benchmark, training_args, model_args)[source]¶

Launch a LightEval job under Slurm with vLLM decoding.

The job command is composed from VLLM_SLURM_PREFIX and a generated task list. For models with >=30B parameters the function enables tensor parallelism; otherwise it defaults to two GPUs to reduce cluster pressure. If system_prompt is provided it is base64-encoded to avoid quoting issues in the Slurm script.

Parameters:

benchmark (str) – Registered benchmark key to execute.
training_args (GRPOConfig) – Training configuration providing Hub identifiers and the optional system_prompt for evaluation.
model_args (ModelConfig) – Model configuration controlling trust flags for remote code and general model loading options.

Returns:

None. A subprocess is spawned for the Slurm submission.

Return type:

None

Raises:

KeyError – If benchmark is not present in LIGHTEVAL_TASKS.
CalledProcessError – If sbatch returns a non-zero exit status.

maxent_grpo.core.evaluation.run_benchmark_jobs(training_args, model_args)[source]¶

Launch one or more benchmarks as Slurm jobs.

When the CLI requests benchmarks=["all"] the function expands this into every registered LightEval task. Each benchmark is delegated to run_lighteval_job with the provided arguments.

Parameters:

training_args (GRPOConfig) – Training configuration whose benchmarks field enumerates the tasks to run (or the sentinel "all").
model_args (ModelConfig) – Model configuration forwarded to run_lighteval_job.

Returns:

None. Jobs are submitted sequentially.

Return type:

None

Raises:

ValueError – If an unknown benchmark name is supplied.