Gradient Compression in Distributed Deep Learning


Introduction to Deep Learning

Deep Learning (DL) has burst onto the scene as one of the most powerful tools we currently have for tackling problems in a range of domains, from machine translation and image captioning, to object and image detection, to speech recognition and much more. Making the distinction between deep learning and machine learning is an important one - much in the way that machine learning is one approach to investigating artificial intelligence, deep learning is a single implementation of machine learning. Particularly, DL focuses on artificial neural networks, abstractions of the function of the brain that aim to model data distributions through co-learning of layers of computational nodes, or neurons.

For a long time this concept wasn’t able to gain traction, since deep neural networks (DNNs) have many nuts and bolts to tune, and there just wasn’t the compute power or data available to do it with. However, along with advances in the two (especially practicality of GPU usage), DNNs began to achieve state of the art results in many fields. A landmark point was the emergence of convolutional neural networks (CNNs) in the field of image recognition in 2012 Krizhevsky et al.. Results on the ImageNet task were still taking close to weeks of compute time, but there was a lot of promise.

Today, this hunger for more data and more computation only grows. For example in the field of natural language processing, transformer architectures are shown to have a correlation between performance and model complexity, resulting in an almost arms-race for unthinkably large models that ask for more and more from us… questions of research accessibility of these models and impact of large-scale training on the environment aside, let’s ask the question of how we can even begin to make this kind of problem scalable.

Scaling Up Model Training

Introducing: distributed deep learning. The idea is that in training a DNN, we should be able to save both time and resources by distributing the workload from one single machine, to multiple machines. Teamwork makes the dream work.

There are many important dimensions to talk about when it comes to distributed deep learning. I want to briefly cover the two main aspects here before pushing into the main topic of compression techniques, however please do check out these extensive surveys to delve deeper into further considerations, such as system architecture and gradient aggregation methods. (Mayer et al., Chahal et al., Tang et al.)

Synchronisation

In the standard setting, a single model is trained to learn a set of parameters \(\theta\) that minimise an objective function defined on the task at hand. In doing this, the model should be generalisable - it has learned a way to map data from a given distribution (e.g. the pixels in an image of cats), to the target result (e.g. is it a cat). This process is called optimisation, and a popular algorithm for this is Stochastic Gradient Descent (SGD). Mathematically this iterative method takes the form:

\[\mathbf{\theta}_{t+1} = \mathbf{\theta}_t - \eta \nabla_{\theta} J(\mathbf{x}_t; \theta)\]

for objective function \(J\), and sample of our data \(\boldsymbol{x}_t\) at timestep \(t\). The way this sample is taken already allows us to determine some of the speed with which the model trains: moving from the stochastic setting of one data point to a larger batch size of many data points can lead to computational speedup due to parallelism in GPUs. Perfect! However there is a tradeoff, where this can harm the generalisability of the model.

Parallelism

How Costly is Update Communication?

Pretty costly. For example, the model size of ResNet-50 is over 110MB - calculating gradients for making gradient descent updates contain millions of floating point values, and so the size of these scale proportionally. Considering a standard Ethernet connection can have a speed of around 1Gbps, then this makes scaling up these distributed systems a problem, as communication between workers can prove to be a bottleneck or rate limiting factor in training models at scale, thus under-utilising computing resources.

There are three main methods in tackling reducing communication overhead in distributed deep learning. These are:

Gradient Quantisation

The idea behind gradient quantisation is to reduce the bit-representation of gradients to make them less costly to communicate. There are three popular approaches to quantisation:

Seide et al. demonstrate that gradients can be compressed to just 1-bit (a representation of their sign) and communicated to provide speedups of up to 100 times on a large 160 million parameter model. As with many techniques in the field, this uses an error-feedback scheme, where quantisation errors are accumulated and added to the next batch, which can achieve the same rate of convergence as SGD (Karimireddy et al.).

Wen et al. introduce TernGrad, a scheme that works by quantising on only three numerical levels {-1,0,1}. Unlike 1-bit quantisation, this is applicable also to gradients arising from convolutional layers. The authors use two techniques to be able to additionally prove the convergence of a model under this quantisation, under the assumption of a bound on the gradients.

Alistarh et al. introduce Quantised SGD (QSGD), which is a family of algorithms that are able to give a tradeoff between bits communicated and variance of the compressed gradients. This allows for tradeoff between bandwidth communication and model convergence time. In words, what QSGD does is takes a probabilistic approach on quantisation, using an encoding scheme that emphasises the likelihood of gradient values into its compression. Compared to TernGrad, this method has no hyperparameters and guarantees convergence under standard assumptions only. However, Wen et al. point out that eliminating accuracy loss in QSGD requires at least 16 levels (4-bit) of encoding, whereas this is achieved with only 3 levels using TernGrad.

Gradient Sparsification

The other family of compression techniques is called gradient sparsification, which involves dropping any gradient values which transmit little to no information in parameter updates. Note that the compression rate of previously mentioned quantisation methods is at best 32 times, due to the limitation in how many bits can be reduced from the standard 32-bit gradients. Therefore compression rates can be greatly helped by sparsification.

Aji et al. note that gradient updates are positively skewed, as most updates are near zero. This motivates the use of cheaper communication via sparse matrix transmission, which is done by mapping a large fraction of updates to zero. This fraction threshold is expensive to find, however can be approximated as 99% through sampling.

Strom et al. adopt a similar method, however choose a static threshold value above which gradients should be zero. This resulted in high speed gains of 54 times for only a 1.8% error on a particular speech recognition task, however it should be noted that the threshold was a hyperparameter that was difficult to tune. Dryden et al. on the other hand employ an adaptive threshold technique, which determines a threshold of gradients to communicate every iteration that results in a fixed compression ratio throughout training. This involves extra overhead compute time in sorting gradient vectors, which is addressed in more recent works.

Further Reading

This quick tour of distributed deep learning reflects on the richness in this area for future study. There are a host of more recent exciting advances, from Sketch communication Ivkin et al., to low-rank approximation gradient compression Vogels et al. and more. I’ve even left out of my description the current most used gradient compression technique, Deep Gradient Compression (DGC) Lin et al.. This technique enjoys impressive whole-model gradient compression ratios of up to 600 times(!) using momentum correction and masking to address issues with staleness in error accumulation and feedback.

Top