Improving ML Datasets with Cleanlab, a Standard Framework for Data-Centric AI

ODSC - Open Data Science
4 min readMar 22, 2023

Editor’s note: Jonas Mueller is a speaker for ODSC East this May 9th-11th. Be sure to check out his session, “Improving ML Datasets with Cleanlab, a Standard Framework for Data-Centric AI,” there!

Anybody who has worked on a real-world ML project knows how messy data can be. While most ML classes teach students about modeling a fixed dataset, experienced data scientists know that improving data brings higher ROI than tinkering with models. However, the process of finding and fixing problems in a dataset is highly manual (based on ad hoc ideas guided by fuzzy intuition).

Cleanlab is an open-source software library that helps make this process more efficient (via novel algorithms that automatically detect certain issues in data) and systematic (with better coverage to detect different types of issues). Our goal is to enable all developers to find and fix data issues as effectively as today’s best data scientists.

What exactly is data-centric AI?

A common gripe I hear is: “Garbage in, garbage out. Everybody knows you need to clean your data to get good ML performance. Is this just another buzzword or what is exactly new here?”

Most of these folks are thinking about manually improving a dataset, which of course remains vital in real-world ML (to utilize your domain knowledge). Data-centric AI instead asks how we can systematically engineer better data through algorithms/automation. To learn more, check out the first-ever class on this subject that we recently taught at MIT and made freely available: https://dcai.csail.mit.edu/

A common pipeline to produce a good ML model looks like this:

  1. Explore the data, fix fundamental issues, and transform it to be ML appropriate.
  2. Train a baseline ML model on the properly formatted dataset.
  3. Utilize this model to detect more issues and improve the dataset more (via data-centric AI techniques).
  4. Try different modeling techniques to improve the model on the improved dataset and obtain the best model.

Step 3 is the key step here where exciting new science is happening. Many data scientists unfortunately think all of the data issues must be manually addressed in Step 1, skipping Step 3 in favor of diving straight into modeling improvements (different loss functions, training tricks, hyperparameter values,…).

But try running cleanlab on your data — you will achieve big gains in ML performance without any change to the modeling code, simply by finding and fixing issues detected algorithmically with the help of the initial baseline model.

How does cleanlab work?

A core cleanlab principle is to take the predictions/representations from an already-trained ML model and apply algorithms that enable automatic estimation of various data issues. Once these are identified, the data can be improved to train a better version of the same type of model! This library works with almost any supervised ML model (no matter how it was trained) and type of data (image, text, tabular, audio, etc). With one line of code, cleanlab can automatically:

find mislabeled data + train robust models
detect outliers and out-of-distribution data
estimate consensus + annotator-quality for multi-annotator datasets
suggest which data is most informative to (re)label next (active learning with multiple annotators)

Let’s consider the issue of incorrect labels in supervised learning data. A recent report by Cloudfactory found that human annotators have an error rate between 7–80% when labeling data (depending on task difficulty and how much annotators are paid). Cleanlab has been used to find millions of label errors in the most famous ML datasets: https://labelerrors.com/

In this Towards AI article, an XGBoost model was trained on a tabular dataset of student grades that had mislabeled examples, achieving 79% accuracy on a test set with validated labels. Cleanlab was run on the training data to automatically detect label issues and the flagged examples were filtered out. This process is entirely automated, and when the same XGBoost model was re-trained on the cleaned data, it achieved 83% accuracy (with zero change to the modeling code). Finally, the labels of these automatically-flagged datapoints were fixed with a human-in-the-loop (using Cleanlab Studio, an efficient no-code solution for correcting data). When the same XGBoost model was re-trained on the cleaned data, it achieved 93% accuracy (still with zero change to the modeling code) — a 70% reduction in error rate from the original model!

Learn more about the data-centric AI techniques that power Cleanlab at our upcoming talk at ODSC East 2023.

About the author/ODSC East 2023 speaker:

Jonas Mueller is Chief Scientist and Co-Founder at Cleanlab, a company providing data-centric AI software to improve ML datasets. Previously, he was a senior scientist at Amazon Web Services developing AutoML and Deep Learning algorithms that now power ML applications at hundreds of companies. In 2018, he completed his PhD in Machine Learning at MIT, and has since helped create two of the fastest-growing open-source libraries for AutoML (https://github.com/awslabs/autogluon) and data-centric AI (https://github.com/cleanlab/cleanlab).

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.