“AntMan: Dynamic Scaling on GPU Clusters for Deep Learning” paper summary

Introduction

GPUs as main accelerators for deep learning training tasks suffer from under-utilization. Under-utilization means in the first place spending money on purchasing the GPU that is not working for us 100%. Second, it means wasting energy. Third, if it was used in a better way, it could mean higher performance! Authors of AntMan [1] propose a deep learning infrastructure, which is a co-design of cluster schedulers (e.g., Kubernetes, SLURM, LSF etc.) with deep learning frameworks (e.g., PyTorch, TensorFlow etc.). It was built, deployed and tested in Alibaba company. Their motivation for this work was their observation on very low GPU utilization on Alibaba cluster. This mechanism utilizes the spare GPU resources to co-execute multiple jobs on a shared GPU to increase GPU utilization. With knowledge about the common behavior of deep learning training tasks, they assume and propose dynamic scaling mechanisms for memory and computation with frameworks, which enables fine-grained coordination between jobs’ kernels and prevents interference between them. Their bottom-line evaluation results show 42% and 34% utilization improvement for GPU memory and GPU computation units, respectively.

Motivation and the proposal

They observe unbelievably low GPU utilization on Alibaba servers and explain the reasons for that observation, which motivates them to tackle it. The GPU under-utilization issue stems from following causes:

  1. Some tasks are small that never can saturate a giant modern GPU’s memory and compute resources
  2. Deep learning training tasks are a mixture of different phases of executions that some of them don’t have a parallel nature. E.g., graph sampling in graph neural networks, feature extraction in advertisement, data augmentation in vision.
  3. When the training is scaled to be done in a distributed manner, the gang scheduling mechanism, which is required by gradient descent, forces resources to wait till all of the resources are free and ready to fire.

The following figure shows how GPUs were utilized in their study.

credit [1] — GPU memory is normalized by the memory capacity

When dataset’s size is enormous, distributed training is the practice. It means using more than one GPU for training, which needs gang-scheduling. In this scheduling, a job will start training unless all required GPUs are simultaneously available, which results in GPU idleness. The following figure shows the average idle time for GPUs and it increases as the number of GPUs involved in the training increases. It is mentioned in the paper that other jobs can be launched in those idle time, but it can affect the scheduling fairness. Also, it is mentioned that elastic training, as an option, is rarely used in production because of the non-determinism it brings into the scene.

credit [1]

They also mention that careless co-execution of training tasks can cause resource interference between jobs, which can cause drastic performance degradation. Also, sometimes if the intention is on GPU memory, some models might crash.

The main idea is to make the scheduler aware of framework information about models fluctuating resources, then the spare resources can be used for other jobs by co-executing them. They guarantee resources for jobs with higher priorities, and schedule other jobs to use the spare resources and increase cluster utilization.

The authors observe that deep learning models on a production cluster, have small sizes and also the batch computation and memory usage shows a recurring pattern. With these observations, they decided to build AntMan system to schedule in a fine-grained manner at every mini-batch boundary.

AntMan manages the GPU memory in a cooperative way with CPU memory. It moves tensors from GPU and CPU memories to prevent failures due to GPU memory shortage. The following figure shows the difference between existing deep learning frameworks and AntMan.

credit[1]

For GPU computation units management, AntMan introduces a GPU operator manager (GPUOpManager) inside deep learning framework. When a GPU operator is ready to execute instead of being directly launched on the GPU, it is first added to this unit. This unit keeps an eye on the GPU and uses the idle GPU time slots to utilize it in a better way. The following figure shows the idea and its differences with (a) exclusive policy (b) shared (c) with proposed unit.

credit [1]

This paper classifies jobs as resource-guarantee and opportunistic jobs. The first kind’s performance is guaranteed as they got scheduler exclusively on a GPU. On the other hands, the second kind is for getting more out of the clusters. AntMan has a hierarchical design in which scheduler based on the hardware statistics makes global decisions, while local ones use information provided by deep learning frameworks to scale and provide higher utilization. The following figure shows how AntMan orchestrates deep learning training tasks.

credit [1]

Evaluation and results

AntMan is implemented in TensorFlow and PyTorch on top of Kubernetes scheduler. For more details, reading the paper is suggested. The evaluations showed on average 42% and 34% improvement for GPU memory and GPU compute units utilization.

The presentation of video of this paper by one of its authors can be found below:

Reference

[1] Xiao, Wencong, et al. “AntMan: Dynamic scaling on GPU clusters for deep learning.14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 2020.

--

--