Dive Into Deep Learning — Part 3

Nouran Ali
4 min readMar 7, 2023

In this part, I will summarize section 3.6 Generalization, section from page 86 to page 113 are all about coding and I found them very clear to understand so I recommend you read them before reading this part.

Generalization

The authors give an example of students who prepare for an exam, student 1 memorizes the past exams questions and student 2 discovers patterns in the questions, if the exam is
1. Identical to past exams: student 1 outperforms student 2.
2. Fresh questions: student 2 will perform better.

The goal of ML is to discover patterns and not simply memorize our training data, the fundamental problem is how to discover that pattern that generalizes.

In real-life ML work, we fit models using a finite collection of data even with the most extreme scale, the number of available data points remains small.

An important note to keep in mind is that we must consider the risk of not discovering a generalizable pattern in the data after training and fitting. This introduces two new concepts:

Overfitting: fitting closer to our training data than to the underlying distribution.
Regularization: techniques for preventing overfitting.

Training error and generalization error

In supervised learning, we assume that training data and test data follow the IID assumption: data is drawn independently from identical distributions.

training error Remp: statistic calculated on the training data.

training error as a summation

regularization error R: expectation taken with respect to the underlying distribution.

regularization error as an integral

In practice, we can’t calculate the generalization error R exactly because:
1. Nobody tells us the precise form of P(x,y).
2. We can’t sample an infinite stream of data points.
So we estimate R.

Model Complexity

In theory, when we have simple models and abundant data Remp and R tend to be close, but with complex models, this is not the case as the generalization gap grows.

We can’t conclude that our model has discovered a generalizable pattern based on the fitting of the training data alone, but, if our model class was not capable of fitting arbitrary labels, then it must have discovered a pattern.

When a model is capable of fitting arbitrary labels, low training error
does not necessarily imply low generalization error. However, it does not necessarily imply high generalization error either!

Underfitting and Overfitting

Both of them are related to model complexity,

Underfitting:
1. generalization gap (Remp-R) is small
2. model is too simple to capture patterns in data
Overfitting:
1. training error is lower than validation error.
overfitting is not always a bad thing in deep learning.

Polynomial curve fitting

Given training data consisting of a single feature x to estimate the corresponding y we try to find a polynomial of degree d.

A higher-order polynomial function is more complex than a lower-order polynomial function since the higher-order polynomial has more parameters and the model function’s selection range is wider.

Influence of model complexity on underfitting and overfitting

Another big consideration to bear in mind is the dataset size, the fewer samples we have in the dataset the chances of overfitting increase, as we increase the amount of training data, the generalization error typically decreases.

Model Selection

We select our final model, only after evaluating multiple models that differ in various ways.

model selection: choosing a final model among many models.

Keep in mind that we should not touch our test set until after we have chosen all our hyperparameters as if we used the test set to select a model we are at the risk of overfitting on the test data.

Cross Validation

Incorporating a validation set in addition to a test and train set helps us address the above problem and select a better model.

When training data is not big enough we may not be able to afford to hold out a validation set we use K-fold cross-validation

  1. The original training data is split into K non-overlapping subsets.
  2. Model training and validation are executed K times, each time training on (K − 1) subsets and validating on a different subset.
  3. The training and validation errors are estimated by averaging the results from the K experiments.

I hope you enjoyed this part as much as I did, please leave comments on any improvements or ideas you have.

BECOME a WRITER at MLearning.ai

--

--

Nouran Ali

Aspiring data scientist, I love writing and creating projects with an impact.