“Looking beyond GPUs for DNN Scheduling on Multi-Tenant Clusters” paper summary

Introduction

Training deep learning models is a heavy task from computation and memory requirement perspective. Enterprises, research and development teams shared GPU clusters for this purpose. Usually, there is a resource manager and scheduler (e.g., SLURM, LFS, Kubernetes, Apache YARN, etc.) on the clusters to get the jobs and allocate GPUs, CPUs, and system memory to the submitted tasks by different users. Those users can be developers from a team, or they can be developers from different teams. The common practice of these resource management systems is that they consider the GPUs as dominant resource and allocate CPU and system memory (DRAM) proportional to the number of GPUs requested by the job. The authors of [1] propose a resource-sensitive scheduler for shared GPU cluster. This scheduler performs a offline profiling for detecting a job’s sensitivity to proportional CPU and memory resource allocation. Their evaluations show that workload-aware CPU and memory allocations improves job completion time up to 3.4X, and higher cluster resource utilization.

Motivation and proposed mechanism

Training phase of deep learning models exhibit varied sensitivity to the amount of system resources (CPU cores, system memory or DRAM). Studies showed that some vision models by getting more system resources than the proportional amount can get faster up to 3X. On the other hand, language model like GNMT remains unaffected by receiving more resources. This observation is the motivation for this proposal. The following figure shows the effect of resources.

credit [1]

In the next step, based on the motivation, the authors propose a scheduling mechanism that considers tasks’ sensitivity to the amount of the CPU and memory resources. This mechanism gives less resource than proportional when it is profiled that the task won’t suffer. They do the offline profiling before executing jobs and they simplify and make it time- and resource-wise tolerable by utilizing the predictable nature of deep learning training tasks.

For the scheduling part, the proposal schedules in a round manner similar to other schedulers. It checks and packs the jobs that can fit on a server, based on the information produced during profiling phase. The problem is multi-dimensional bin-packing, which is NP-hard. The authors designs and evaluate two heuristic algorithms, which the second one is better to use as they show it with their evaluations.

Results and Conclusion

The authors evaluate their proposal on real-world cluster with 4 servers, which each one is equipped with 8 V100 GPUs, 500GB DRAM, and 24 CPUs. Furthermore, they evaluate through simulating two clusters: one with 128 GPUs across 16 server, the other one with 512 GPUs across 64 machines. The reason that they get speedup is that they relieve system bottleneck for some workloads that need more system CPU and DRAM. They decrease the number of data stalls in the system. Also, they get better system resources utilization too.

The presentation video of the paper by one of its authors can be found in the reference section.

Reference

[1] Mohan, Jayashree, et al. “Looking beyond GPUs for DNN scheduling on Multi-Tenant clusters.16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 2022.

--

--