“A Study of Checkpointing in Large Scale Training of Deep Neural Networks” paper summary

Published in

Computing Systems and Hardware for emerging applications (especially Machine Learning, Deep Learning)

5 min readOct 14, 2023

Introduction

Deep learning tasks usually demand high computation/memory requirements and their computations are embarrassingly parallel. In this way, high-performance computing (HPC) systems are a good target for them. The paper claims that distributed training has been facilitated by deep learning frameworks, but fault tolerance did not get enough attention. Then, the paper experiments with and evaluates checkpointing, a common practice in HPC systems for fault tolerance, in deep learning frameworks (Chainer, PyTorch, and TensorFlow). The evaluation includes computational cost, file formats, file sizes, the impact of scale, and deterministic checkpointing. The authors based on their experimenting provide the community with tips on using checkpointing for deep learning training on HPC systems.

Motivation

The deep learning training process is a long taking process with high computing and memory demands. HPC as a distributed system is susceptible to hardware and software failures that can waste the learned weights by the neural network. Frameworks for dealing with this issue implement checkpointing as a fault tolerance mechanism to save the training state and restore it if a failure happens. But, in deep learning checkpointing is not only a mechanism for handling fault tolerance issues. Some deep learning practices like transfer learning rely on checkpointing.

However, checkpointing is ignored big time by the deep learning community and also what exists now is not suitable for HPC systems. The authors emphasize that there is a need for a suitable checkpointing mechanism but before that getting an overview of what is available and characterizing its needs and costs would be promising.

Background

Distributed training

Major deep learning frameworks like PyTorch and TensorFlow provide the possibility of doing distributed training with GPUs. HPC clusters have been coming to the attention of people to do their training on them and they tend to use those major frameworks and target nodes with more than one GPU. In the paper, distributed training is implemented in two different ways:

(1) with native resources of deep learning frameworks (distributed data-parallel)

(2) with an external library like HOROVOD (all-reduce operations to combine gradient values before applying gradients to the model weights)

Different deep learning frameworks use different implementations of distributed training. PyTorch, Chainer, and Horovod use the ChainerMN library, DistributedDataParallel (DDP), and hvd.DistributedOptimizer() wraps around PyTorch and TensorFlow. Horovod uses all-reduce operations to combine gradient values before applying them to the model weights. Using checkpointing in deep learning frameworks is straightforward but not automatic and the deep learning programmer should add special instructions to the source code to do it.

Methodology

Hardware systems

The systems that the authors did their experiments:

Power9 cluster in the Marenostrum supercomputer (Spain): a cluster of 52 nodes — each node consisted of 2 IBM Power9 CPUs and 4 NVIDIA 16GB V100 GPUs.
ABCI supercomputer (Japan): consisting of 1088 nodes of FUJITSU server PRIMERGY CX2570 M4. Each node has two Intel Xeon Gold 6148 Processors and four NVIDIA Tesla 16GB V100 GPUs. The GPUs are connected intra-node to the CPUs by PLX switches and PCIe Gen3 x16 links, and together by NVLink.

Experiments

Horovod with PyTorch and TensorFlow is used to implement distributed training.

(on Marenostrum) for evaluating (a) the computational cost of checkpointing (b) file size and format (c) the deterministic behavior of the checkpoint-restart mechanisms of different frameworks — up to 32 GPUs were used. — ResNet50 + Cifar10 with a batch size of 64 – 100 epochs
(on ABCI) — to evaluate the effect of scale on checkpointing, the behavior of different models, and how each framework performs at scale. — up to 256 GPUs — ResNet50 (25.5M params) and VGG16 (128M params) models + cifar10 with a batch size of 32 – 100 epochs

Checkpoint frequency: 5 epochs

result

Computational cost of checkpointing

The following table shows the execution time of ResNet50 trained on Cifar10 for 100 epochs. Don’t forget that the checkpointing frequency is 5 epochs meaning that the model is saved after every 5 epochs.

PyTorch scales up best with 32 GPUs. On the other hand, TensorFlow benchmarking has the lowest overhead. The reason for TensorFlow's efficiency comes from its adoption of the HDF5 file format, in which its serialization process is highly optimized. The results show that checkpointing mechanisms do not change the performance of deep learning frameworks when scaling, also the cost is not negligible.

The following table shows the file size and format of checkpointing for the chosen frameworks for the paper. Keep in mind that the size of the model and dataset can affect the file sizes.

The following table provides more information on how these checkpointing mechanisms affect performance with a higher number of nodes.

The results in the table show that TensorFlow has been highly optimized to run in single-node mode, but it has not been optimized to run in a large-scale distributed system. Chainer, on the other hand, is a framework that was conceived from the beginning to scale and this can be observed in the results. But, the checkpointing cost in Chainer is so high. The reason is that checkpointing is a sequential operation performed by only one process while other processes wait in Chainer. TensorFlow has the lowest checkpointing cost.

Deterministic checkpointing

The authors edited frameworks to make the computations deterministic meaning that running the same training would produce the same results. For example in TensorFlow and PyTorch check here and here, respectively.

The following figure shows using PyTorch with checkpointing and restarting does not hurt model accuracy and loss.

For showing the costs of different checkpointing, determinism for the chosen frameworks.

Another point to learn from this paper:

Checkpoint implementations come with a significant overhead as many GPUs remain idle during checkpointing

References

[1] Rojas, Elvis, et al. “A study of checkpointing in large scale training of deep neural networks.” arXiv preprint arXiv:2012.00825 (2020).

[2] https://github.com/horovod/horovod