Student Technical Paper
Technical Paper
Slurm Simulator: Improving Slurm Scheduler Performance on Large HPC systems by Utilization of Multiple Controllers and Node Sharing
Event Type
Student Technical Paper
Technical Paper
Facilitation Tags
HPC Facilitation
Technical Paper Tags
Technical Paper
Student Technical Paper
TimeTuesday, July 2410:30am - 10:45am
DescriptionA Slurm simulator was used to study the potential benefits of using multiple Slurm controllers and node-sharing on the TACC Stampede 2 system. Splitting a large cluster into smaller sub-clusters with separate Slurm controllers can offer better scheduling performance and better responsiveness due to an increased computational capability which increases the backfill scheduler efficiency. The disadvantage is additional hardware, more maintenance and an incapability to run jobs across the sub-clusters. Node sharing can increase system throughput by allowing several sub-node jobs to be executed on the same node. However, node sharing is more computationally demanding and might not be advantageous on larger systems. The Slurm simulator allows an estimation of the potential benefits from these configurations and provides information on the advantages to be expected from such a configuration deployment. In this work, multiple Slurm controllers and node-sharing were tested on a TACC Stampede 2 system consisting of two distinct node types: 4,200 Intel Xeon Phi Knights Landing (KNL) nodes and 1,736 Intel Xeon Skylake-X (SLX) nodes. For this system utilization of separate controllers for KNL and SLX nodes with node sharing allowed on SLX nodes resulted in a 40% reduction in waiting times for jobs executed on the SLX nodes. This improvement can be attributed to the better performance of the backfill scheduler. It scheduled 30% more SLX jobs, has a 30% reduction in the fraction of cycles that hit the time-limit and nearly doubles the jobs scheduling attempts.