Towards Scalable Distributed Workload Manager with Monitoring-Based Weakly Consistent Resource Stealing

Hits: 2401
Type of Publication:
  • Wang, Ke
  • Zhou, Xiaobing
  • Qiao, Kan
  • Lang, Michael
  • McClelland, Benjamin
  • Raicu, Ioan
One way to efficiently utilize the coming exascale machines is to support a mixture of applications in various domains, such as traditional large-scale HPC, the ensemble runs, and the finegrained many-task computing (MTC). Delivering high performance in resource allocation, scheduling and launching for all types of jobs has driven us to develop Slurm++, a distributed workload manager directly extended from the Slurm centralized production system. Slurm++ employs multiple controllers with each one managing a partition of compute nodes and participating in resource allocation through resource balancing techniques. In this paper, we propose a monitoring-based weakly consistent resource stealing technique to achieve resource balancing in distributed HPC job launch, and implement the technique in Slurm++. We compare Slurm++ with Slurm using microbenchmark workloads with different job sizes. Slurm++ showed 10X faster than Slurm in allocating resources and launching jobs – we expect the performance gap to grow as the job sizes and system scales increase in future high-end computing systems.

© 2018 New Mexico Consortium