ParanorML activity: Avoiding ML casualties

How to effectively safeguard your ML experiments

9 min readFeb 16, 2023

Many years ago…

I was staring at my screen, scrutinizing every little wiggle on my Tensorboard. Something was off. Problem was, I couldn’t put my finger on it. I had been trying to train a model for a couple of weeks now (Yes, that’s not a typo. Weeks). The validation accuracy would rise consistently for the first 30 epochs of training and then BAM! It would plunge into the depths of NaN-sanity.

It made no sense. I had checked and re-checked everything too many times. The model architecture, the hyperparameters, the dataset, the training script, and whatever else you can imagine. No matter what I changed, the same behavior persisted. At the 31st epoch, things would go ballistic.

At this point, half exasperated, I turned to a colleague and asked them to look through it. Fresh eyes spot things that fatigued retinas miss. My colleague didn’t see anything out of the ordinary but asked me a question that changed my approach to training these models forever.

“What are your tests telling you?”

Yes, I had logged everything meticulously, visualized all the key plots and metrics, and documented the code to an inch of its life.

But I hadn’t comprehensively tested it.

Eventually, I figured out the issue (it involved a stray random tensor that was conditionally invoked at the 30th epoch). But, I lost weeks of time and effort in the process. There began a practice that I continue to this day and would like to share with you so that you can avoid the frustrations and head-banging (not the fun kind) that I went through.

The art of ParanorML Practitioning.

Before we dive into what it is, let’s understand why it’s needed in the first place.

Traditional software vs Machine learning

In traditional software engineering, you write a piece of code, and you know what output to expect. If you don’t get said output, there are deterministic ways to figure out why and fix any errors in the process.

Why? You write the instructions/rules for the computer to follow. If it doesn’t work, your instructions are wrong. The logic to solve the task is the input.

With machine learning, you write a piece of code to define a model, share with it lots of data repeatedly, and the expected output. If you don’t get said output, there isn’t a deterministic way to figure out why.

Why? The model learns the instructions/rules to follow by itself from the data you provide. If it doesn’t work, your implementation could be wrong, the data might be wrong, the output might be underspecified, or a ton more reasons. The logic to solve the task isn’t the input anymore.

Photo by ThisisEngineering RAEng on Unsplash

In the words of Andrej Karpathy, “It fails silently”.

Silent failures cost you. Not just in terms of the result but the time and cost to run these experiments. Most experiments take a few days to run, if not weeks, and small issues are incredibly hard to uncover until much later. Compute power isn’t cheap either and the less said about deadlines, the better. Planning machine learning experiments within a fixed timeframe is a logistical nightmare.

Blindly trusting that everything will work is like watching a cursed videotape. Your ML experiment will perish in 7 days.

The most amazing thing about modern machine learning is that it somehow finds a way to work. Even if there’s a problem.

The most infuriating thing about modern machine learning is that it somehow finds a way to work. Even if there’s a problem.

So how do you tackle this devilish beast? Who you gonna call?

The art of being paranoid

The core of the problem here is that failure in ML is silent, non-deterministic, and costs a lot of time. To combat it, we must strive to make even the slightest error loud and as deterministic as possible while minimizing turnaround time.

The best way I’ve found to do this is by being paranoid when designing experiments.

I do this through three overarching principles:

Test-driven development
Embrace simplicity and delay abstraction until needed
Prioritize small and quick experiments before scaling up

Now, these aren’t new and definitely not the best thing since sliced bread, but by adopting these principles (which I call ParanorML Practitioning), I’ve cut down wasted time and effort from days to hours to minutes.

A test is the first user of your code

The heading above is one of my favorite quotes from the book “The Pragmatic Programmer”. In it, the authors argue that by shining the light of a test on your code, things become clearer. Test-driven development or TDD is a school of programming that focuses on writing tests first, seeing them fail, and then writing code to make the tests pass. The intuition behind this is that if thinking about testing is so beneficial, then why not write them upfront?

Diving into the nuances of TDD is beyond the scope of this article. Instead, I’ll focus on how I use it for my ML work.

Let’s say I’m building an image classifier. Years ago, I would have jumped straight into model architecture, training, and scaling up the pipeline to run on many compute devices.

Today, I know better. I first start with some questions. What will the end use case look like? Who will call the API I write (another team/ a customer / my grandma)? I write a test to verify that call. Obviously, it will fail since there’s no actual model code written. Then, I write code to make this test pass.

I look at the data next and validate some hypotheses:

The data is balanced
It is unbiased (LOL, this is never true, but I do it just for vibes)
The labels are correct
The data splits are stratified
There is no data leakage

The list goes on…

By being paranoid about the tiniest things that can go wrong, you can save yourself from a world of pain later. Say your dataset has images of airplanes, ships, and mitochondria (I don’t know, it’s a weird dataset). I can bet you a fortune that the model will confuse images of airplanes as ships. Why? The background for images of ships is usually blue since they float on the ocean. The background for airplanes in flight is usually blue since they are surrounded by the blue sky. Now your model doesn’t know apriori that both sky and water are blue. Thus when it sees the color blue, it biases toward either a plane or a ship depending on which class has more images.

When the inevitable model failure occurs, you’ll be busy debugging the code when the issue is actually with the data. Be paranoid instead.

The same holds true when you write code for the model and the overall pipeline.

When you code up the model architecture, check if the tensor shapes are correct. You can do that by passing a dummy tensor through the model. If you use an activation function before returning the output from the network, make sure you use the correct loss function. You can write simple unit tests for each of these situations and verify that things work as they should.

Your tests serve as the scaffolding around which you build your architecture (pun intended).

Given how “templatized” training and validation scripts have become, it can be tempting to copy over the pipeline from another project and just run with it. Regardless of whether you give in to that temptation or not, add a bunch of tests to check that the pipeline is doing exactly what you want. For example, Is the data preprocessed correctly? Why? If you use augmentations, you want to make sure that things look good after the data has been modified. Also, it’s generally good practice to normalize the data. So, naturally, check if it’s been normalized properly. Are the metrics the right ones to use? Are the outputs properly generated? Don’t assume. Write tests first and validate!

I can keep on listing things like these but you get the drift.

We live in a world where Co-pilot and ChatGPT have become our rubber duckies. They write code for us as soon as we think of which keys to press on our fancy Keychron Q1 pro mechanical keyboards. Resist the temptation to blindly trust their memorized wisdom. Use them, but with a healthy serving of ParanorML Practitioning.

Delayed gratification abstraction

The second trap I used to fall into was to “fancify” every piece of code I wrote. Forgive me, but I come from a C++ background, where beautiful object-oriented code is celebrated by the four horsemen who gave us design patterns.

Python, at least to my knowledge, doesn’t exactly play by those rules. Members of a class can be manipulated from anywhere if we’re not careful. This can lead to unexplained behavior like my stray tensor issue from years ago. Writing code with the sheer aim of making it beautiful and abstract is like garnishing a moldy paella with fresh organic cilantro. From a distance, it looks rustic and artisan. Up close, the code smells.

That isn’t rustic paella… it smells

These days, I focus on writing functions and then spruce them up depending on the use case. Combined with my paranoid test-driven approach, this yields consistently reliable results. When you need to refactor code from a Jupyter notebook to beautiful OOP files, the tests you’ve written will help validate your conversion efforts. Win-win!

Now, I’m not saying abstracted code is bad by any stretch of the imagination. I’m just saying it’s better to use it when needed. When starting out, always bias toward simplicity and gradually take the ferry towards complexity island.

The unreasonable effectiveness of small bets

Each day, ML research unveils shiny new models that do incredible things. Do you remember the time when we were gobsmacked by a model generating a high-resolution image of a burger? How naïve were we?

The problem is though, these new models take up tons of compute. Even with the equivalent of half a nation’s servers, these take weeks to train before producing meaningful results. Naturally, replicating these experiments or trying new ones ourselves at this scale is infeasible and often leads to wasted effort, many cups of coffee (not a bad thing), and raccoon eyes.

The solution is to hedge our bets on smaller-scale experiments — Try a bunch of things and see what works. Then take the most promising ideas and scale them up. I’ve found this saves me so much time and allows me to iterate very quickly given my limited compute resources and time.

Jeremy Howard, the creator of fast.ai religiously follows this approach. It’s something he taught to students in his recently concluded Stable Diffusion course. Here’s an excerpt from a podcast where he talks about this approach.

I’ve been working on is exactly what you describe. Which is, how to train and play with a state-of-the-art image-generative model in a notebook on a single GPU. I’m doing all my work — just about — on the Fashion-MNIST dataset. Which, rather than being 512x512 pixel images of literally anything in the world, including artworks, in three channels, Fashion-MNIST is 28x28, single-channel images of 1 of 10 types of clothing. I always tell people — whether you’re doing a Kaggle competition, or a project at work, or whatever — the most important two steps are to “Create a rapid feedback loop where you can iterate and test fast”, and to “Have a test which is highly correlated with the final thing you’re going to be doing”. If you have those two things, you can quickly try lots of ideas, and see if they’re probably going to work on the bigger dataset, or the harder problem, or whatever.

By working on small-scale experiments, you enable a rapid feedback loop. It also allows you to verify that you have tests that correlate with your end goal and make adjustments as needed.

So there you have it. My secret sauce for fewer machine learning casualties. Test ahead, embrace simplicity and bet heavily on small-scale experiments.

Most importantly, be paranoid and become a ParanorML Practitioner.

🤖💪 Want more ideas to be a productive ML practitioner?

Each week, I send out a newsletter with practical tips and resources to level up as a machine learning practitioner. Join here for free →

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com