Evaluation ========== This project supports LightEval benchmarks with vLLM decoding and optional Slurm scheduling. Benchmarks ---------- Built‑in tasks are registered in ``core.evaluation.LIGHTEVAL_TASKS``: - ``math_500``, ``aime24``, ``aime25``, ``gpqa:diamond`` - LCB code generation variants (extended suite) Launching Jobs -------------- The helper ``core.evaluation.run_benchmark_jobs(training_args, model_args)`` resolves the requested benchmark names (or ``all``) and submits jobs via Slurm using the vLLM OpenAI server. Typical flow: 1) Train your model (or pick a Hub model id) 2) Ensure vLLM is available on your cluster image 3) From Python (or integrate in your pipeline), call: .. code-block:: python from core.evaluation import run_benchmark_jobs # training_args.hub_model_id / hub_model_revision drive evaluation targets run_benchmark_jobs(training_args, model_args) Notes ----- - For large models (≥ 30B) or MATH‑heavy runs, the job script increases GPU count and uses tensor parallelism. - Set ``MAXENT_EVAL_SLURM_SCRIPT`` to your site-specific evaluation launcher (default path is ``ops/slurm/evaluate.slurm``) if you want to customize cluster resources. - To evaluate a single suite locally without Slurm, adapt ``run_lighteval_job`` to spawn ``vllm`` and ``lighteval`` processes on your workstation.