A Comprehensive Guide to Error Analysis in Machine Learning

A step-by-step error analysis for a classification problem, including data analysis and recommendations

Kemal Toprak Uçar
8 min readApr 17, 2023

When it comes to artificial intelligence interviews, one of the most important questions is, “What are the phases of an AI model’s life cycle?” Developing a machine learning model involves several steps: problem formulation, data collection, preprocessing, feature engineering, model-building, deployment, and monitoring. In this post, we will take a deep dive into the model building step, specifically focusing on model validation. Before deploying the model to production, it is crucial to ensure its performance metrics. For classification models, the F-score, precision, and recall metrics determine the model’s validity during validation. In addition to performance metrics, error analysis can also help evaluate the model’s performance in terms of data rather than performance metrics.

image by author

Introduction

Error analysis is a vital process in diagnosing errors made by an ML model during its training and testing steps. It enables data scientists or ML engineers to evaluate their models’ performance and identify areas for improvement. By investigating a model’s errors, practitioners can acquire insights into the quality and relevancy of their data, the complexity of their problem, and the effectiveness of their feature engineering and model selection techniques.

Shortly, the error analysis can aid us with several motivations;

  1. recognize sources of errors
  2. assess and enhance model performance
  3. maintain model relevance

Here are some examples of how error analysis can be utilized in various areas:

In image classification, error analysis examines misclassified images and determines why the model failed to classify them. For instance, if a model trained to specify different fruits misclassifies an image of an apple as a pear, we can scrutinize the features that distinguish apples from pears and understand why the model missed those features in the image.

In speech recognition, error analysis involves investigating audio recordings and identifying patterns in the model’s errors. For example, if a model misidentifies the word “Beckham” as “Beckenbauer,” we can look at the phonetic resemblances between the two words and determine why the model drove that mistake.

Photo by Quin Engle on Unsplash

In sentiment analysis, error analysis analyzes misclassified text examples. For instance, if a model classifies customer reviews and mislabels a positive review as a negative review, we can study the specific words and phrases that led to the misclassification and determine why the model failed.
Error analysis in tabular data introduces distinctive challenges compared to other data types. One reason is that the features from tabular data are often less intuitive, making it difficult to understand why the model makes predictions based on the input features. Furthermore, the number of features can be large, and it can be challenging to identify which ones contribute to errors. The associations between features can also be intricate, further complicating the identification of root causes of errors. In text data, features such as length, the number of specific characters, or the amount of syntax errors can be extracted. However, in tabular data, feature extraction is often more limited, particularly for date and text-based features. Despite these challenges, it is still possible to conduct error analysis in tabular data, and in this post, I will explore some methods for doing so.

Checks Before the Error Analysis

Before proceeding to error analysis, I want to discuss model complexity and dataset size. The relationship between data size and model complexity is a tradeoff between the ability to capture complex relationships in the data and the risk of overfitting. As a general rule, a larger dataset can support a more complex model, but it’s essential to monitor the model’s performance on validation or test data to ensure it generalizes well to new data. The dataset size is critical because we will try to extract some insights from the dataset, and both validation and training datasets should be good samples that represent the population well. In this post, we will cover error analysis on a Gradient Boosting Model using a dataset with a size of 7K.
Besides the size, it is significant to discuss the model metrics. The metric must be chosen carefully. Even if the dataset is thoroughly balanced, the accuracy metric alone will not comprehensively explain the model’s performance since it is not based on per-class accuracy. Using a confusion matrix, ROC curve, and F1 score would improve your skills to challenge model performance before model selection. In addition to the distribution of target value, it is also noteworthy to know how much the tolerance for false negatives or false positives is.
As a last step before the error analysis, we should ensure the labels are sufficiently reliable. If the labels do not represent the variables well, we should stop working on modeling and move back to fixing the data collection part. If scenarios such as a six-year-old kid buying an apartment (quality issue) or a vegetarian person buying a hundred KGs of entrecote for her restaurant (outlier) exist in your dataset, these factors affect your model’s performance.
In cases where the data contains missing values, outliers, or categorical variables, it’s important to address these issues before training the model to guarantee that the model is able to learn from the data effectively. By addressing these issues before conducting error analysis, it’s likely that any observed errors can be attributed to the model rather than the data itself.
If everything regarding the dataset seems well, we move to the error analysis! The goal is to extract the most problematic data subsets the model fails to extract some insights.

Model Building

Before I provide an example, I would like to mention that you can access the entire code on the repository.

In this example, we will be training and evaluating a churn prediction model using a Kaggle dataset. The dataset contains 19 mostly categorical features, an ID column, and a target column indicating whether a customer has churned or not.

customerID          category
gender category
SeniorCitizen category
Partner category
Dependents category
tenure int64
PhoneService category
MultipleLines category
InternetService category
OnlineSecurity category
OnlineBackup category
DeviceProtection category
TechSupport category
StreamingTV category
StreamingMovies category
Contract category
PaperlessBilling category
PaymentMethod category
MonthlyCharges float64
TotalCharges float64
Churn category

As is typical in churn prediction problems, the dataset is imbalanced. In this dataset, the proportion of not churned customers is 73%. After performing brief data analysis, we identified certain features that are likely to help us better distinguish customers who have churned.

predictors = ['Contract',
'OnlineSecurity',
'TechSupport',
'InternetService',
'PaymentMethod',
'OnlineBackup',
'DeviceProtection',
'StreamingMovies',
'StreamingTV',
'PaperlessBilling',
'Dependents',
'SeniorCitizen',
'Partner',
'tenure']

To build the model, we used the LightGBM classifier and employed Optuna for hyperparameter tuning. The final model achieved an F-score of 0.77, but we won’t go into further detail on the model building here. If you’re interested, you can find more information in the repository.

              precision    recall  f1-score   support

No 0.84 0.89 0.86 1466
Yes 0.71 0.63 0.67 647

accuracy 0.81 2113
macro avg 0.78 0.76 0.77 2113
weighted avg 0.80 0.81 0.80 2113

Error Analysis

In order to evaluate error analysis, we created a new dataset that contains both the target and predicted values and the difference between target and predicted probability values. This dataset is used to calculate metrics for categorical and continuous features separately.

For categorical features, we grouped each category and calculated the mean of the target prediction match and the mean difference between tthe target and predicted probability. Based on this analysis, we identified certain categories that performed poorly in distinguishing churned customers.

As shown in the graph below, the model achieves a 70% accuracy rate in identifying churned customers when the contract type is Month-to-month.

image by author

For continuous features, since groupby does not work due to the distribution, we discretized the values and used the resulting groups to obtain insights.

The graph illustrates that the model’s performance is poor for the lower values of the tenure feature.

After the analysis, we identified categories that had poor performance in distinguishing churned customers.

to_be_analyzed_categories = ['Contract', 
'PaymentMethod',
'DeviceProtection',
'SeniorCitizen',
'TechSupport',
'StreamingMovies',
'InternetService',
'OnlineSecurity']

As a first step, we compared the distributions of these categories in the training and validation datasets. Since the distributions were similar, we concluded that the poor performance was not due to differences in the distributions of these categories.

Column: Contract
Training
Month-to-month 0.545639
Two year 0.240568
One year 0.213793
Test
Month-to-month 0.560814
Two year 0.240890
One year 0.198296
---------------------------------------------------------------
Column: PaymentMethod
Training
Electronic check 0.337931
Mailed check 0.225761
Bank transfer (automatic) 0.220284
Credit card (automatic) 0.216024
Test
Electronic check 0.330809
Mailed check 0.236157
Bank transfer (automatic) 0.216753
Credit card (automatic) 0.216280
---------------------------------------------------------------
Column: DeviceProtection
Training
No 0.442191
Yes 0.339351
No internet service 0.218458
Test
No 0.433034
Yes 0.354472
No internet service 0.212494
---------------------------------------------------------------
Column: SeniorCitizen
Training
0 0.839757
1 0.160243
Test
0 0.833412
1 0.166588
---------------------------------------------------------------
Column: TechSupport
Training
No 0.491886
Yes 0.289655
No internet service 0.218458
Test
No 0.495977
Yes 0.291529
No internet service 0.212494
---------------------------------------------------------------
Column: StreamingMovies
Training
No 0.391075
Yes 0.390467
No internet service 0.218458
Test
No 0.405584
Yes 0.381921
No internet service 0.212494
---------------------------------------------------------------
Column: InternetService
Training
Fiber optic 0.437931
DSL 0.343611
No 0.218458
Test
Fiber optic 0.443445
DSL 0.344061
No 0.212494
---------------------------------------------------------------
Column: OnlineSecurity
Training
No 0.494726
Yes 0.286815
No internet service 0.218458
Test
No 0.501183
Yes 0.286323
No internet service 0.212494
---------------------------------------------------------------
Photo by Javier Allegue Barros on Unsplash

Suggestions

Next, we focused on the importance of these categories. For example, if a client who has an Electronic check category at PaymentMethod has a very low tolerance for wrong churn predictions, we should focus on improving the model performance for this category.

One approach to improving the model performance is to try different sets of features for hyperparameter tuning without changing the training pipeline. If this approach does not yield good results, we can revisit the feature selection part again. We may also consider the correlation between predictors, which we did not fully explore in this analysis.

If these approaches do not yield satisfactory results, changing the classifier may be a good second option. In this experiment, we used LGBM, but other classifiers like CatBoost, XGBoost, or Artificial Neural Networks (ANNs) could be useful for further experiments.

If all these approaches fail to yield satisfactory results, then gathering more data for the model may be necessary. Besides the data and model suggestions, consulting with domain experts in the company would also be a good step. Sharing this analysis, which shows the model performance on some feature categories, with domain experts, may help identify patterns and suggest other features to consider.

Final Words

In this post, I aimed to provide a brief overview of error analysis and demonstrated its application in a real-world example. While I understand that error analysis can be highly domain-specific, I have tried my best to provide more general suggestions that can be applied to a wide range of problems. Please keep in mind that the suggestions I provided may not be applicable to all situations, and consulting with domain experts may provide more tailored and specific advice.

--

--

Kemal Toprak Uçar

Loves research and programming. Besides the technology, analog photography, books, alternative rock, and the architecture are indispensable by his side.