MLOps Blog

Why is Git Not the Best for ML Model Version Control

Vidhi Chugh

8 min

27th July, 2023

ML Tools MLOps

These days enterprises are sitting on a pool of data and increasingly employing machine learning and deep learning algorithms to forecast sales, predict customer churn and fraud detection, etc., across industries and domains.

Data science practitioners experiment with algorithms, data, and hyperparameters to develop a model that generates business insights. However, the increasing scale of experiments and projects, especially in mid to large-size enterprises, requires effective model management. Data science teams currently struggle with managing multiple experiments and models and need an efficient way to store, retrieve, and utilize details like model versions, hyperparameters, and performance metrics.

In this article, you will learn about:

the challenges plaguing the ML space
and why conventional tools are not the right answer to them.

It will further build upon the need to compare experiments that call for reproducibility, visibility, and collaboration across the board in data science teams. You will learn where Git falls short to maintain different model versions and will be presented with the tools that provide the required capabilities.

ML model versioning: where are we at?

The short answer is we are in the middle of a data revolution. All the key data offerings, like model training on text documents or images, leverage advanced language and vision-based algorithms. Interestingly, the mathematical concept of neural networks existed for a long time, but it is only now that training a model with billions of parameters has become possible. Let’s understand these breakthrough developments through a couple of examples.

Latest algorithmic advancements call for increased parameter search space

ImageNet, the popular image dataset, has played a pivotal role in the development of deep neural networks. Starting from AlexNet with 8 layers in 2012 to ResNet with 152 layers in 2015 – the deep neural networks have become deeper with time. Deeper networks mean increased hyperparameters, more experiments, and in turn more model information to save in a form that can be easily retrieved when needed.

ILSVRC winners — *Winners of the ILSVRC | Source*

The GPT3 (175 billion parameters) and DALL-E (12 billion parameters) have been dwarfed by the MT-NLG (530 billion parameters) and Switch Transformers (over a trillion parameters). These large models require extensive hyperparameter search including the number of hidden layers, neurons, dropouts, activations, optimizers, dropouts, epochs, etc.

Code repository expansion at a large organization

Let us understand the scale of AI initiatives from the wide range of products and services offered by Google. Most of its products use machine learning or deep learning models for some or all of their features. The chart below showcases the number of commits to Google’s central repository during 2000-15.

The chart with the number of commits to Google’s central repository — *The number of commits to Google’s central repository during 2000-15 | Source*

Given that the organizations like Google are working on cutting-edge innovation in AI, their repositories are exponentially expanding. This means that finding performance metrics for model A in Project Z, given the scale of experimentation is difficult. Not only the performance metrics but also the hyperparameters leading to these metrics is a challenge to find, let alone reproduce the results.

Managing the ever-expanding universe of model experiments

The premise of building any algorithm is to generate business value, i.e., the enterprises strive to take the experiments to production. The complete pipeline starting from the data collection to model output in the production environment is key to the success of the endeavor.
Multiple scenarios can play out at different stages of this pipeline to diminish the potential gain from this exercise. Even during the model training stage, the storage, retrieval, and interpretation of hyperparameters and performance metrics can adversely impact the selection of the best model.
It is not only important to choose the best performance metrics for the model under development, but it’s also vital to store these values. A simple revision in evaluation metrics, e.g., from “F1 score” to “ROC,” calls for re-running the pipeline and storing results.
The mere availability of off-the-shelf algorithms does not imply they can be used as is. It requires significant effort in terms of data preparation, exploration, processing, and experimentation, which involves trying out algorithms and hyperparameters. It is so because these algorithms have proven great results on a benchmark dataset, whereas your business problem and hence your data is different. One of these algorithms might work better than others, but you need to find the best configuration. The best configuration in terms of finding the right architecture, as well as the hyperparameters, are derived from extensive experimentation.

Need of the right tools and processes to achieve business value

Right set of tools and processes is crucial to facilitate knowledge dissemination across the team. Further, maintaining model versions will save the risk of losing the model details in case the original model developer is longer working on the project. You also need to store model metadata and document details like configuration, flow, and intent of performing the experiments.

The three most important traits of such a tool are visibility, reproducibility, and collaboration.

Visibility

A simple way to understand the meaning of visibility is through asking the following questions. These questions will also assist you in assessing the right tool for your project.

1 Can the project manager view what models are being trained?
2 Which model version is running in production?
3 What is the model’s performance in development vs production?
4 Which metric was used to optimize the model parameters?
5 Which data version was used to train the model?
6 What hyperparameters produced the best metrics?

Now that we understand visibility shares vital details of the model, let us learn what the barriers to visibility are:

Decoupled pieces: The data, code, configuration, and results are generated at different steps during the project.
Different tools: Your repository consists of multiple tools, libraries, and infrastructure providers like Azure, AWS, and GCP.
Outdated and sparse documentation: Often, the documentation is decoupled from the model artifacts and stored at a different place. The team has to make an effort to maintain it, and thus in no time, it might become outdated.
Model breaks in production: Oftentimes, when models graduate from the development environment to the production environment, they don’t perform as well as they did during model training. At other times they just break. There could be multiple possible reasons, but it mainly occurs because of missing models and data trails between the two environments.

Reproducibility

You need to be able to repeat the steps of an experiment and produce the same results.

The following factors can impact the reproducibility quotient of an experiment:

Variation in incoming data: Any change in training, validation, or test data can lead to unfamiliar results, thus delaying deployment and value realization for an organization. Some of the scenarios where the variation can happen are:
- Additional training data missing in the root path
- Dataset shuffling can impact the results of a mini-batch or stochastic gradient descent optimized results.
- Different random seeds while training and testing
Inconsistent hyperparameters: Hyperparameter values can change the model architecture (trees and networks) and need to be stored to reproduce results.
Changes in ML frameworks: The use of different versions or updates to a version of an ML framework can produce inconsistent results. Packages like Keras when used with different backends (PyTorch or Tensorflow) can produce unidentical results.

Collaboration

With the growing team size, collaboration also grows beyond emails, messengers, and shared drives. Oftentimes, organizations manage communication over slack channels or teams. But it has a fairly good chance that developers either might not remember all the information on top of their minds or may miss communicating in general. Such reliance on sharing the latest code version via chats invites the risk of misinformation which can cost the project in terms of quality or time delays.

There could be multiple hurdles to collaboration, as nuanced below.

When team members are working on different versions of code – the latest version of the code is either not committed, or the co-developer forgot to pull the latest code from the server.
The model development and deployment team may not be using the latest model
One factor which is often ignored is the data. Data keeps getting generated throughout the model development. The model development team finds it difficult to log which dataset was used for which model version

Is Git the solution for versioning ML models?

Git is a great tool that supports code versioning and collaboration across traditional software development. But it is not built with machine learning models in mind. It is a wonderful tool for supporting software development but is not purposed for a machine learning workflow that involves data, models, artifacts, etc.

Limitations of Git are listed down:

Git does not save model details like model versions, hyperparameters, performance metrics, data versions, etc. You might push model results and metadata after every experiment run, but retrieval, comparison, and analysis of this data and metadata will soon become a pain with an increasing number of experiments.
Git cannot also automatically log each experiment. Isn’t it ideal that each of your experiments automatically logs itself to a repository with all the information associated with it?
Choosing Git for logging the requisite details comes with a manual overhead of maintaining separate documentation for the intent and purpose of the experiment, choice of algorithms, and hyperparameters, along with the results. Ideally, the experiment should constitute the documentation, i.e., an explanation of the what, why, and how of the experiment.

Why should we look beyond Git?

It is important to note the differences between software development and machine learning solution development and deployment, as they form the crux of the limitations of Git for ML.

software development life cycle vs. ML project life cycle — Differentiation between software development life cycle and ML project life cycle | Source

Conventional software development follows a more simplistic approach where a developer builds an algorithm to solve a business problem where it processes an input to produce the desired output. On the other hand, machine learning developers utilize data for training a model, and then the learned model is deployed to make inferences.

One clear difference between software development and machine learning development is that the output of the first is deterministic, whereas the second produces a probabilistic output. Thus it becomes imperative for the machine/model to learn continuously by incorporating new information as and when a deviation from the prior learning is discovered.

Coming back to the learning and re-learning requirement of machine learning models, any solution involving training and inference also requires rigorous experimentation, mainly because the rules aren’t hand-crafted.

Git is designed to handle the waterfall development model typically used in any software development project. Any iteration in the code is well captured by the system. But what happens when iterations aren’t limited to code, and you need to version code, data, model, performance metrics, as well as hyperparameters all at once in a bundle. This is when Git looks underwhelming, and you need to look beyond it.

All these factors contribute to the need for tools customized for Machine Learning workflow.

Traits of an ideal tool for Machine Learning experimentation

Tracking: The bare minimum ask from every data practitioner is to log the results of each experiment. But why should you just stop here? Tracking different model performances becomes far easier when you go one extra step and visualize the performance of different experiments. These help in generating reports and dashboards for all stakeholders to monitor, provide feedback, and present the results to the business.

Versioning: Consider a case that a developer successfully built the model and logged the metrics to prove its worth too. Now, this best-performing model gets deployed in the production environment. But as with any other code, it is bound to break. Even if the code is working as expected, the output might not be. Owing to the black-box nature of machine learning algorithms, silent failures are a common sight.

The degraded model performance calls for the original author to look at the underlying cause and bring it back up. The developer runs the model in the dev environment to reproduce the results but fails to find the corresponding code base. This highlights the need to log the entire metadata set that can help the developer trace back the production replica model in the dev environment. An ideal tool tracks and maintains different model versions along with entire metadata.

Documentation: The code’s original author might not always be available to share the gory details of the deployed model (or any other archived model, if ever the need arises). Hence, an ideal tool provides a platform that maintains a log of all the relevant details for each model run. It saves manual documentation and promotes the iterative nature of ML projects which is to learn from the data and compare results.

Platform agnostic: It should seamlessly integrate and work with any infrastructure, tool, or library.

Alternatives to Git for ML model versioning

While Git isn’t a perfect tool for machine learning pipelines and solutions, there are a few tools that solve some or all the challenges faced by machine learning teams. They are discussed below.

neptune.ai

neptune.ai supports research as well as production environments and is a go-to metadata store that supports all your machine-learning workflows. It is built with the prerogative that all machine learning experiments are inherently iterative by nature. The scale of experiments requires the data scientist to compare a myriad number of models, which Neptune helps by facilitating ease of monitoring and visualizing their performance.

Neptune's UI-quickstart — Neptune’s UI | Source: Neptune

Provides the platform to track these experiments to prevent repeat implementation and facilitate reproducibility.
Promotes cross-collaboration and allows different team members to work together on different projects by sharing the UI links.
Supports both on-premise and cloud versions.
Offers multiple advantages such as logging and displaying all metadata like model weights, hyperparameters, etc. – the best of all is its user-friendly UI that provides a seamless experience to compare and analyze multiple experiments.

MLflow

MLflow is an open-source framework that streamlines the end-to-end machine learning flow, including but not limited to model training runs, storing and loading the model in production, reproducing results, etc. It comes with lightweight APIs that support any machine learning library and programming language and can integrate with any code easily.

The forte of MLflow lies in its wide community reach and support, which is essential for any open-source platform. It has four key components –

MLflow Tracking: Logs and compares code versions, model parameters, and output
MLflow Models: Packages the code for downstream usage – real-time serving or batch results for inference purposes
MLflow Projects: Organizes the code to reproduce results, deploy the model to production, or share across the team
MLflow Registry: It is a model store that provides model lineage, model version, metadata, tags of various model development stages, etc.

May be useful

Check an in-depth comparison between MLflow and Neptune.

DVC

DVC supports machine learning project version control by versioning models, datasets, and intermediate files.

Supports a wide range of storage systems including but not limited to Amazon S3, Azure Blob Storage, Google Drive, Google Cloud Storage, etc.
Keeps caching the intermediate artifacts that make it easy to iterate over multiple experiments and leverage the archived code, data, or model.
Instead of storing large files in Git, it works efficiently by storing the metadata, model artifacts, etc. in Git.
What stands out is its GitOps support i.e. it works on top of the Git repositories and connects the machine learning projects with Git workflows.

Weights & Biases

Weight & Biases speeds up the model development process by enabling easy model management, experiment tracking, and dataset versioning.

Weights and Biases' UI — *Weights and Biases’ UI | Source: Weights and Biases*

Provides seamless integration to the machine learning code and displays the key metrics and statistics on the dashboard.
Lets you visualize the model outcomes across different model versions and author those findings through collaborative reports.
Supports PyTorch Lightning, Keras, Tensorflow, and HuggingFace to name a few. Unlike Neptune, Weights & Biases do not offer a free version beyond free academic and open-source projects.

One of the distinguishing features of Weights and Biases is that it automatically duplicates the logged datasets and versions them too.

May be useful

Check an in-depth comparison between Weights & Biases and Neptune.

Comet

Comet helps organizations accelerate and optimize the model-building process by providing experiment management, model management, and monitoring deployed models.

Integrates with current tech stack and streamlines the cumbersome model-building process end to end – that is from model training cycles to production runs.
Supports custom visualizations using Plotly and matplotlib in addition to 30+ visualizations.
Identifies and alerts the system in the event of model or data drift and presents the deviations concerning the baseline model performance during model training.

Sacred

It is an open-sourced web dashboard to configure, organize, log, and reproduce machine learning experiments. It is aimed at relieving the overhead of maintaining parameters and settings from different experiments from the model developers, thereby allowing them to focus on key aspects of model development.

Though it can be deployed on the cloud or on-prem but is not provided as a managed service.
Given that it is available for free, it does not come with chat or email assistance for enterprise needs.
You can visualize model metrics and logs from multiple experiments through a web dashboard called Omniboard. Guild AI is one of the useful alternatives to Sacred + Omniboard.

How to choose the right tool for versioning ML models?

While this post has listed and shared multiple platforms that support ease of model monitoring and maintenance, there is no standard way to choose one over another. Some of the factors that can help you take the decision and should be part of your checklist are listed below:

1 Ease of use
2 Future releases
3 Product vision and enhancements
4 Admin rights and access
5 Ease of setup and get started
6 Customer support
7 Framework and programming language agnostic or not?
8 Potential to scale

Wrapping up!

In this post, we discussed the limitations of Git from the purview of machine learning projects. By understanding why Git falls short and what an ideal tool would look like, we explored the current options in the market and how they compare against each other. I hope this article has provided you with clarity about the need for alternatives to Git and what you should keep in mind to pick the right one.