Evaluation¶
This project supports LightEval benchmarks with vLLM decoding and optional Slurm scheduling.
Benchmarks¶
Built‑in tasks are registered in core.evaluation.LIGHTEVAL_TASKS:
math_500,aime24,aime25,gpqa:diamondLCB code generation variants (extended suite)
Launching Jobs¶
The helper core.evaluation.run_benchmark_jobs(training_args, model_args) resolves the requested benchmark names (or all) and submits jobs via Slurm using the vLLM OpenAI server.
Typical flow:
Train your model (or pick a Hub model id)
Ensure vLLM is available on your cluster image
From Python (or integrate in your pipeline), call:
from core.evaluation import run_benchmark_jobs
# training_args.hub_model_id / hub_model_revision drive evaluation targets
run_benchmark_jobs(training_args, model_args)
Notes¶
For large models (≥ 30B) or MATH‑heavy runs, the job script increases GPU count and uses tensor parallelism.
Set
MAXENT_EVAL_SLURM_SCRIPTto your site-specific evaluation launcher (default path isops/slurm/evaluate.slurm) if you want to customize cluster resources.To evaluate a single suite locally without Slurm, adapt
run_lighteval_jobto spawnvllmandlightevalprocesses on your workstation.