Double Descent Phenomenon

Pauline Ornela MEGNE CHOUDJA
5 min readApr 11, 2023

Overall , Machine learning’s main objective is to find a tradeoff between the model’s ability to fit the training data and its ability to generalize to new data. This is the so-called bias-variance tradeoff. In this blog we will talk a bit about the bias-variance tradeoff and drop on double descent phenomenon.

Bias-variance tradeoff derivation (for regression)

Let:

Equation 1

Where S is the training data and h* the absolute model or the best model obtained if we had access to the distribution of the population.

  • h_s, the model obtained after training on S.
  • (x,y) a test example such that :
Equation 2
Equation 3
  • h_avg, the best model in the hypothesis space.

The expected error/loss is given by:

We will add the term h_avg and remove it.

Where,

  • 𝞂^2 is the irreducible or unavoidable error due to the measurement.
  • The bias is the difference between the best model in the hypothesis class (h_avg) and the best possible model (h*). It represents the error due to the lack of expressivity of the model.
  • The variance is the error due to the randomness of the data. It’s the difference between the best model we find (h_s) and the best model in the hypothesis class (h_avg).
Figure 1: Illustration of the bias and variance definition.
Figure 2: Bias-variance tradeoff

The figure 2 shows that with few parameters, the model tends to have high bias (underfitting) and low variance, while a complex model with many parameters tends to have low bias and high variance (overfitting). Therefore, the goal is to find a model with an optimal balance between bias and variance that can generalize well to new data.

Underfitting occurs when the model is too simple to capture the complexity of the dataset, resulting in poor performance on both training and test set. To address this, one can increase the capacity of the model by selecting a more powerful class of function or look at some hyperparameters (number of iteration/epochs, learning rate, optimizer, etc).

Overfitting on the other hand occurs when the model is too complex, resulting in poor performance on test data. To solve this we can:

  • Decrease the capacity of the model by selecting a smaller class of functions or apply some regularization approaches (L1 and L2 regularization).
  • Increase the size of training data.
  • Apply optimization techniques like early stopping (it implies stopping the training at the point where the model starts to overfit).
  • Use the cross validation technique to provide a more accurate estimate of the generalization error.

We saw that the test error first decreases (low complexity) and then increases as the model becomes more complex. However, in several works [1], it’s observed that the test error can have a second descent as described by the double descent phenomenon.

Double descent phenomenon

There are two different manifestations of the double descent phenomenon in machine learning.

Model-wise double descent

Model-wise double descent is observed by varying the complexity of the model, such as the number of parameters or the norm of the model (the norm of a model here refers to a mathematical function that measures the size or the magnitude of the model’s parameters, e.g L1 norm and L2 norm). Conventionally, as the model complexity increases, the test error first decreases and then increases to a peak around when the model size is large enough to fit all the training data very well, and then decreases again in the so-called over-parameterized regime, where the number of parameters is larger than the number of data points [1].

Sample-wise double descent

Here the phenomenon is observed by varying the size of the training dataset. Some works [2] observed that the generalization error doesn’t strictly decrease as we increase the sample size. Instead, it decreases first, and increases to peak around when the number of samples is similar to the number of parameters, and then decreases again.

Experimental test

To illustrate this phenomenon, we generated two toy datasets for regression task.

The first dataset consists of 500 points and 1000 features. We built a simple linear regression model to evaluate its performance with respect to the complexity of the model. For this reason, we plot the test error against the number of parameters (number of features) of the model. Figure 3 shows that the test error increases and peaks when the number of sample (n) is approximately equal to the number of parameters (d), n≈d=500 and steadily decreases when number of parameters become much larger than number of samples, d>>n.

Figure 3: Model-wise double descent phenomenon for linear regression model

On the other hand the second dataset consists of 1000 points and 500 features. As before, we built a simple linear regression model and plotted the test error against the number of samples to assess its performance regarding the sample size. Similarly, the graph displayed on figure 4 illustrates that the test error rises and reaches its maximum when the sample size (n) is roughly equal to the number of features (d); and finally drops where d>>n.

Figure 4: Sample-wise double descent phenomenon for linear regression model.

Conclusion

This work gives a brief overview of the double descent phenomenon, a novel concept discovered in 2019 [3]. This phenomenon was observed through some algorithms such as linear regression and neural networks [4] and remains an active area of research in the field of Machine Learning/Deep Learning.

Hope this was helpful and enhanced your curiosity 😉.

References

[1]: https://cs229.stanford.edu/main_notes.pdf

[2]: More Data Can Hurt for Linear Regression: Sample-wise Double Descent

[3]: https://www.pnas.org/doi/full/10.1073/pnas.1903070116

[4]: Optimal Regularization Can Mitigate Double Descent

BECOME a WRITER at MLearning.ai

--

--