Calibration Techniques in Deep Neural Networks

Published in

Heartbeat

9 min readJun 14, 2023

Introduction

Deep neural network classifiers have been shown to be mis-calibrated [1], i.e., their prediction probabilities are not reliable confidence estimates. For example, if a neural network classifies an image as a “dog” with probability p, p cannot be interpreted as the confidence of the network’s predicted class for the image. Further, neural network classifiers are often overconfident in their predictions [1]. A calibrated neural network classifier outputs class probabilities that match the true correctness likelihood of the class in the ground truth dataset [1]. Thus, if a calibrated classifier labels 100 input images as “dog” with 0.4 probability, then the number of images that are accurately classified as a dog should be approximately 40 (40% of 100) [2]. Calibrating neural networks is especially important in safety-critical applications where reliable confidence estimates are crucial for making informed decisions. Some of these applications include medical diagnosis in healthcare and self-driving cars.

Measuring Miscalibration

Reliability Diagrams

Reliability diagrams depict the gap between accuracy and calibration that results in miscalibration. For example, the figure below shows the reliability diagram for a Resnet-110 model trained on the CIFAR-100 dataset [1].

Reliability Diagram of an uncalibrated Resnet-110. X-axis: Confidence, Y-axis: Accuracy[1]

Guo et al. [1] explain that the reliability diagram is plotted as follows.
For each sample i, we use the following notation -

The model predictions for all samples are grouped into M bins based on their prediction probability. Each bin Bₘ is a set of indices of samples whose prediction probability lies in the following interval

The prediction probability interval for each bin [1]

The accuracy of a bin Bₘ is given by -

The confidence of a bin Bₘ is given by -

Thus, the reliability diagram plots the accuracy of M bins against their confidence. Perfect calibration is indicated by acc(Bₘ) = conf(Bₘ) for all M bins, i.e., the blue bars align perfectly with the diagonal line. The red bars indicate miscalibration, such that

If the blue bar is below the diagonal line, the bin’s accuracy is lower than the confidence of the bin. It indicates that the model is over-confident for samples in this bin, and the red bar depicts the amount of over-confidence for that bin.
On the other hand, if the blue bar is above the diagonal line, the bin’s accuracy is higher than the bin’s confidence. It indicates that the model is under-confident for samples in this bin, and the red bar depicts the amount of under-confidence for that bin.

Expected Calibration Error

Since reliability diagrams are only a visual tool for analyzing model miscalibration, Guo et al. propose Expected Calibration Error as a scalar metric to quantify the calibration error. It is the average weighted difference between accuracy and confidence across M bins for n samples.

For a perfectly calibrated model, ECE = 0. However, it is not possible to obtain a perfectly calibrated model. Refer to this paper for more calibration metrics.

ECE Calculation Example

Let us assume that we have 5 images in our dataset. Here are the ground truth labels for these 5 images with predicted labels (from a neural network) and their associated probabilities.

Consider the number of bins M as 4. Then, the bins are given by

The accuracy and confidence of each bin is

Hence, the ECE can be calculated as

Calibration Techniques

In this section, we explore a few common calibration techniques. These also include techniques that were not proposed as calibration techniques but have been shown to calibrate neural networks. Note that we do not cover all calibration techniques presented in the literature.

Platt Scaling

Platt scaling [3] is a post-hoc calibration technique (i.e., applied after training a neural network) that uses the score (or logit) of the predicted class as an input feature to train a logistic regression model that outputs calibrated probabilities. Binary classification models produce only one classification score, so we can obtain the calibrated probabilities as follow

This can also be extended for a multi-class classification problem, and this method is termed matrix scaling [1]. For a k-class classification problem, a linear transformation is applied to the logits to obtain k-transformed logits.

Transformed logits using Matrix Scaling [1]

Using these transformed logits, the calibrated probability and the predicted class are obtained as follows

Calibrated probability using matrix scaling for multiiclass classifer [1].

Since W’s parameters increase quadratically with k, Guo et al. propose vector scaling in which W is restricted to be a diagonal matrix. The parameters in all variants of Platt Scaling are optimized by minimizing the cross-entropy loss with respect to a held-out validation set (sometimes called the held-out calibration set). Note that the network’s weights are not updated when optimizing these parameters.

Temperature Scaling

The temperature parameter T of softmax can be used to rescale probability values resulting in a change in output distribution.

Temperature parameter of softmax

T < 1 makes the output distribution peakier (reduces entropy), while T > 1 softens the output distribution (increases entropy). This post from stats stack exchange provides further intuition behind T.
In Temperature Scaling [1], T is treated as a scalar parameter and optimized by minimizing the cross-entropy loss with respect to a held-out validation set. The network’s weights are not updated when optimizing for T.

Calibrated probability using Temperature Scaling [1]

Guo et al. [1] show that tuning a single parameter T can significantly improve a neural network’s ECE. They also show that temperature scaling performs better than matrix and vector scaling on different image classification datasets across CNN architectures. Finally, the reliability diagrams below depict an uncalibrated Resnet-110 network compared with a Resnet-110 network calibrated with temperature scaling. For more temperature scaling and Platt scaling results, please refer to Table 1 of the paper. Temperature scaling is also a post-hoc calibration technique.

Reliability Diagrams of calibrated and uncalibrated models [1]

Label Smoothing

Label Smoothing was introduced by Szegedy et al. [4] as a regularization technique for deep neural networks. In label smoothing, the targets for a classification model are softened by performing a weighted combination of the original target y and the uniform distribution over labels 1/K (which does not depend on training examples). The parameter ɑ controls the degree of smoothing.

Label Smoothing Equation [5]

In their 2019 paper “When does label smoothing help?”, Hinton et al. [5] show that applying label smoothing results in implicit calibration of Neural Networks. The authors claim that the the softening of targets leads to softening of softmax logits (i.e. logits of the final layer), which helps reduce overconfidence and subsequently reduces the network’s ECE. Below, the results are shown for image classification (Resnet-56 and Inception-v4 architectures) and machine translation tasks (Transformer architecture). The table also indicates that the effect of label smoothing and temperature scaling on neural network calibration are similar.

ECE of Label Smoothing for image classification and machine translation tasks [5]

Mixup

Mixup is a data augmentation technique in which new samples are generated by performing a convex combination of two randomly sampled images and their corresponding labels [6]. Combining labels of two samples is similar to label smoothing, i.e., it yields soft labels for the newly generated images. The parameter ɑ assigns weights to the two images and their respective labels for the convex combination.

In the paper “On mixup training: Improved calibration and predictive uncertainty for deep neural networks”, Thulasidasan et al. [7] state that mixup has a regularization effect on the network resulting in reduced overfitting and memorization, which curbs overconfidence of the network. The plots below show that mixup has the lowest ECE (calibration error) with a high test accuracy compared with other calibration techniques like label smoothing across four datasets on the image classification task.

Mixup outperforms other calibration methods [7]

Focal Loss

Focal loss was proposed to tackle class imbalance in vision tasks like object detection [8]. Focal loss modifies the cross-entropy loss using a multiplicative factor that allows the network to focus on hard samples that are difficult to classify correctly. For easy-to-classify samples, the predicted probability will be higher, and the multiplicative term will make the loss for that sample to be very small. This allows the network to focus on samples with a higher loss.

In the paper “Calibrating Deep Neural Networks using Focal Loss “, Mukhoti et al. [9] show that Focal Loss minimizes a regularized KL divergence between the predicted (softmax) and target distributions.

Relationship between Focal Loss, KL Divergence, and Entropy [9]

Thus, the authors posit that minimizing the focal loss requires minimizing the KL divergence between the two distributions and increasing the predicted distribution’s entropy. This increase in entropy softens the output distribution and curbs overconfident predictions in deep neural networks. The authors also propose a sample-dependent schedule called FLSD which dynamically assigns a value for the parameter Ɣ based on pre-defined ranges of the predicted probability. Moreover, it is shown that focal loss can be combined with temperature scaling to improve the network’s calibration further. Broadly, focal loss and FLSD perform better than baseline calibration methods like label smoothing across different architectures for CIFAR-10, CIFAR-100, and Tiny-Imagenet. Refer to Table 1 of the paper for an overview of all the results.

Conclusion

In this article, we introduced the concept of calibration in deep neural networks. We discussed how reliability diagrams and ECE measure calibration error. Finally, we explained about a few calibration techniques that can enable neural networks to output reliable and interpretable confidence estimates.

References

[1] Guo, Chuan, et al. “On calibration of modern neural networks.” International conference on machine learning. PMLR, 2017.

[2] Lin, Zhen, Shubhendu Trivedi, and Jimeng Sun. “Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks.” arXiv preprint arXiv:2202.07679 (2022)

[3] Gualtieri, J. Anthony, et al. “Support vector machine classifiers as applied to AVIRIS data.” Proc. Eighth JPL Airborne Geoscience Workshop. 1999.

[4] Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[5] Müller, Rafael, Simon Kornblith, and Geoffrey E. Hinton. “When does label smoothing help?” Advances in neural information processing systems 32 (2019).

[6] Zhang, Hongyi, et al. “Mixup: Beyond empirical risk minimization.” arXiv preprint arXiv:1710.09412 (2017).

[7] Thulasidasan, Sunil, et al. “On mixup training: Improved calibration and predictive uncertainty for deep neural networks.” Advances in Neural Information Processing Systems 32 (2019).

[8] Lin, Tsung-Yi, et al. “Focal loss for dense object detection.” Proceedings of the IEEE international conference on computer vision. 2017.

[9] Mukhoti, Jishnu, et al. “Calibrating deep neural networks using focal loss.” Advances in Neural Information Processing Systems 33 (2020): 15288–15299.

[10] Nixon, Jeremy, et al. “Measuring Calibration in Deep Learning.” CVPR workshops. Vol. 2. №7. 2019.

[11] What is the role of temperature in Softmax?[Cross Validated]

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.