Image generated by Stable Diffusion

Efficient Machine Learning Pipelines with DVC and MLFlow

George Kamtziridis

--

Developing machine learning applications is a task that encapsulates many distinct steps. It starts by framing the problem, collecting, cleaning and analyzing the required data and then we move to the part where we build and deploy the model. From experience, the latter steps account for approximately 10–20% of the overall time and effort we put into the process. Of course, these actions are not strictly sequential, because usually, if not always, after we run some experiments we tend to play around with the data in order to achieve better performance. Additionally, we may alter the experiments themselves, for example try different model parameters or even other methods for tuning the hyperparameters. And this cycle goes on and on until we reach the desired results.

Tracking the experiments does not only urge from the need for structurally better model building, but it potentially affects the entire product where the machine learning application lives. For instance, think of a scenario where the CMO of your company for the period of summer 2023 wants to use the exact same model that we’ve used during summer 2022. And by “same” we mean the same model in terms of parameters and the exact same training data. This is a business decision that we, as engineers, must pull it off. If the experiments were tracked appropriately, re-launching the model should be a matter of minutes. In the worst case, we would have to re-train the model, but we would know for sure that we have the correct settings. If the experiments were not tracked, then… good luck.

It should be clear by now that this cycle can end up being quite messy. We train a bunch of machine learning models with dissimilar configurations and data that we compare through error metrics to find out the one that suits us the best. The question is: how can we efficiently manage this flow? Is there a way or a method or a tool that can help bring order to this chaotic situation? Luckily, the answer is “Yes” and the methodology is called Machine Learning Operations or MLOps in short. You have probably heard the term DevOps in conventional software development. Data Scientists and Machine Learning engineers in an act of “jealousy” adopted the concept and changed the term to MLOps.

The main goal of MLOps is to meticulously track the process of training and deploying a model into production. In this article, I’m going to focus on how to track models during training. A lot of people start their experiments in a Jupyter notebook which is really helpful when it comes to testing some configurations or confirming some assumptions. However, when it comes to running full scale experiments the need for a more structured approach arises. While working on different ML research projects and ideas I realized that in every project I pretty much reinvented the wheel in regards to MLOps. Time after time, I was creating a similar boilerplate in order to properly track my experiments. That being said, during my last project I decided to isolate this boilerplate and change it a bit to make it more generic. My goal was to construct a reusable and extensible scaffold that could be used by anyone that wants to monitor the data and the models during training.

Data Version Control and MLFlow are the two main pillars of the solution. DVC is a very powerful and practical open-source tool which helps with versioning huge datasets. It is built on top of git, and it works in a very similar way. DVC uses external storage such as Azure blob storage, Amazon’s S3, Google cloud storage or even a basic google drive folder in order to store the version history of large data. MLFlow is, also, an open-source platform that manages entire machine learning workflows, from tracking experiments and code packaging with reproducible runs, all the way to sharing and deploying models.

Essentially, I’ve grouped these two alongside other useful libraries such as Sklearn in order to build a really flexible workbench capable of hosting fully trackable experiments. The entire solution is available on Github where I outline its structure and use as well as how it can be extended. For the rest of this article, I’ll go through a slightly more comprehensive analysis of the scaffold.

To better understand the flow let’s say that we’re working with the well-known diabetes toy dataset where we want to predict how the disease progresses. We are dealing with a regression problem, since our target is a continuous variable. For the sake of this walkthrough, we will choose to use a decision tree which is a pretty basic regressor. It goes without saying that you’re free to choose any model from the Sklearn library or, in general, any model that can be utilized by the Sklearn API. So, for our decision tree we will need to create a very primitive script:

The script consists of 3 distinct phases: the initialization of the model, the parameters’ setting and the `run_experiment` call. The initialization should be straightforward to all Sklearn users. The parameters are set in a way to be fed into a grid search process that will handle the hyperparameter tuning. Besides grid search, we support random search and Bayesian optimization. For random search the parameters are set in exactly the same way. However, for Bayesian optimization they are configured a bit differently since we use the scikit-optimize library:

Congratulations! You are now ready to run any experiment you like. Before doing so, make sure to save your dataset in the `data` folder which is fully tracked by DVC and run `pip install .` to install the required dependencies (I strongly recommend creating a virtual environment). The experiments can be started via command line. For instance, to run an experiment named “Test” with grid search optimization we can do:

py .\decision_trees_experiment.py --data-file "./data/diabetes.csv" --mode "grid_search"  –target “Y” --experiment-name "Test"

When the experiment is complete, you will find in the `mlruns` folder a new folder named after the id of the experiment that was conducted. Each experiment will be stored in the `mlruns` folder. To visualize the results, you will have to run `mlflow ui` and then open `http://localhost:5000` in your browser. There you will be able to find every little detail about all the experiments and effectively compare them. For more details on this, make sure to check the official documentation. Additionally, in the `results` folder we store specific details for each experiment, such as error metrics, the model itself and some informative plots, like feature importance, partial dependence and permutation importance plots. Both `mlruns` and `results` folders are tracked by DVC. The results can be enhanced at will by adding the corresponding evaluation metrics in the `core/evaluation.py` script. Also, since this is a regression task all the error metrics are regression oriented, meaning that for a classification task we will need to adjust them.

To test how Bayesian optimization works we will need to create a new run in the context of the previous experiment. To do so, run:

py .\decision_trees_experiment.py --data-file "./data/diabetes.csv" --mode "bayesian"  –target “Y” --experiment-id {add_experiment_id_here}

The entire argument list is explained in the readme file. New arguments can be added through the `core/utilities.py` file. Another useful parameter is ` — scale` which dictates which scaling should be applied to the data. In the `core/scale.py` file we have predefined two different scalers: standard and min-max. To test how standard scaler affects the results run:

py .\decision_trees_experiment.py --data-file "./data/diabetes.csv" --mode "bayesian"  –target “Y” –scale “standard” --experiment-id {add_experiment_id_here}

New scalers or new ways of applying scaling can be added in the `scale.py` script.

Last but not least, it’s really important to push the results on your repository. To do so, make sure you have installed DVC on your machine and then follow the instructions to set up DVC remote storage. Then you will be able to use `dvc add` and `dvc push` to push the results and allow your collaborators to see your progress.

And that’s pretty much it! Feel free to fork the repository to use it as is or change any part you like to make it fit your needs better. If you think that I’ve missed something in the flow, please open a PR and propose your solution/fix/enhancement. I would be glad to merge it.

Github repository: https://github.com/gkamtzir/mlops-scaffold

BECOME a WRITER at MLearning.ai

--

--

George Kamtziridis

Full Stack Software Engineer and Data Scientist at fromScratch Studio. BEng, MEng (Electrical Engineering/Computer Engineering) MSc (Artificial Intelligence)