Machine Learning Model Training Mistakes: How to avoid them

Abinaya Mahendiran
5 min readFeb 2, 2023
Mind Map: Mistakes in ML model training

This blog highlights some important mistakes that one can make while training a machine learning model. Machine Learning model training is the process of teaching a model how to recognize patterns in data. During the training phase, the model is exposed to a dataset and “learns” how to distinguish between different features and predict outcomes accurately. The goal of the training phase is to optimize the model’s performance so it can make accurate predictions when exposed to new data.

What can go wrong in ML model training?

There are a variety of issues that can happen during model training which affects the quality of the model. Common mistakes include overfitting, underfitting, faulty data preprocessing, class imbalance, and missing values. These mistakes can lead to inaccurate results and a decrease in performance. It’s important to be aware of these potential issues and take steps to avoid them when training ML models.

Let’s look at some of the most important mistakes that need to be avoided during ML model training. These mistakes can be broadly categorized into the following,

i) Data issues — There are many issues related to the data which includes improper preprocessing, failure to address the missing values, class imbalance, and data leakage; and non-availability of data to solve the problem at hand. Let’s look at the hard to catch mistakes in detail,

— Not looking for data leakage — Data leakage is a problem that occurs when a model is given access to data that it should not have access to. This can lead to overly optimistic predictions and incorrect results. Common causes of data leakage include using test data in the training process, using data from future time points, and using data that is not connected to the problem at hand. To prevent this mistake, divide the data into three splits — training, validation, and test — and perform all exploratory data analysis and preprocessing on the training split only. Additionally, re-use vectorizers on the validation and test sets for preprocessing.

Data Leakage

— Not using the appropriate test set — Test set measures the generality of the model. The model’s performance on the training set is almost meaningless since any model can memorize the patterns of the training set and will fail to generalize on the unknown test set. To make sure the model generalizes well, the test set should be representative of the wider population. For example, if a model is trained and tested on pictures taken on a sunny day, the test set is not independent of the training set and is not representing the broader weather condition.

ii) Model issues — Not choosing the right model which eventually results in underfitting or overfitting, discarding the insights from the model’s prediction and evaluating the model on wrong metrics are some of the major issues related to the model. The following are some mistakes that are easily missed,

Evaluating the model on the wrong metrics — It is important to use the right metrics when evaluating the performance of a ML model, as the wrong metrics can be more damaging than not having one at all. The most commonly used metric is accuracy, but this is not suitable for all use cases; on imbalanced datasets, metrics like precision, recall, and F1-score provide more detailed insights. Choosing the right metric helps you track the performance of your model and make sure it is meeting the desired criteria.

Metrics for ML

Not looking at the model — Explaining a ML model can help in understanding how it works and makes predictions. Techniques such as feature importance and decision trees can help you gain insights into the inner workings of a model. You can also use techniques such as saliency maps, LIME, and SHAP to understand how the model is making its predictions. Doing so will help in identifying potential issues or weaknesses in your model and taking steps to improve its performance.

SHAP summary plot

iii) Process issues — Any non-technical issues related to decision making, not having the right KPI/metric for success, governance, bias and fairness or the complete ML design process comes under process issues. These are human errors that can be easily avoided.

Not understanding the use case — Before beginning any ML project, it is important to make sure that the use case actually requires a ML solution. If ML isn’t required, then you should not be afraid to build a solution without it. If ML is required, then be sure to clearly define the problem, have sufficient data, consider the business KPI, determine the metrics to be used, and be aware of the limitations, costs, and expected impact of the model.

No governance — Failing to establish a process on how an organization controls access, implements policies and tracks activities on models and the results, will be detrimental for both the users and the organization. It’s imperative to define the overall governance process before you even start thinking about handling or implementing any ML use cases.

In conclusion, ML model training can be tricky and mistakes can lead to inaccurate results. It is important to be aware of potential mistakes related to the data, model and the process of decision making and take steps to avoid them.

*Note: This blog only highlights some aspects of the model training mistakes in detail. There is a lot of free information around the exhaustive aspects mentioned in the mind map which can be easily googled!

Thank you for reading and appreciate your feedback. See you with another interesting topic soon!

--

--