MLOps Blog

How to Build an Experiment Tracking Tool [Learnings From Engineers Behind Neptune]

Stephen Oladele

11 min

13th September, 2023

MLOps

As an MLOps engineer on your team, you are often tasked with improving the workflow of your data scientists by adding capabilities to your ML platform or by building standalone tools for them to use.

Experiment tracking is one such capability. And since you are reading this article, the data scientists you support have probably reached out for help. The experiments they run are scaling and becoming increasingly complex; keeping track of their experiments and ensuring they are reproducible have gotten harder.

Building a tool for managing experiments can help your data scientists;

1 Keep track of experiments across different projects,
2 Save experiment-related metadata,
3 Reproduce and compare results over time,
4 Share results with teammates,
5 Or push experiment outputs to downstream systems.

This article is a summary of what we’ve learned from building and maintaining one of the most popular experiment trackers for the past five years.

Based on insights from our very own Piotr Łusakowski (architect), Adam Nieżurawski (back-end technical lead), and other engineers at neptune.ai, you’ll learn:

How to develop requirements for your experiment tracking tool,
What the components of an ideal experiment tracking tool are, and how they satisfy the requirements,
How to architect the backend layer of an experiment tracking tool.
Technical considerations to make for building an experiment tracking tool.

The focus of this guide is to give you the necessary building blocks to build a tool that works for your team. This article does not cover the technology considerations for building an experiment tracking tool or writing code to build one.

We will focus on the building blocks because any code written would not be important in a week, and any tool would likely be forgotten after six months.

Developing the requirements for an experiment tracking tool

On the development side, there are three major problems you solve when you build an ML experiment tracking tool, which includes:

Helping your data scientists handle metadata and artifact lineage from data and model origins.
Giving your data scientists an interface to monitor and evaluate experiment performance for effective decision-making and debugging.
Giving your data scientists a platform to track the progress of their ML projects.

A graph with reasons for building an experiment tracking tool — *Three reasons you need to build an experiment tracking tool*

Handling metadata and artifact lineage from data and model origins

An experiment tracking tool can help your data scientists trace the lineage of experiment artifacts from their data and model origins, store the resulting metadata, and manage it. It should be possible to locate where the data and models for an experiment came from, so your data scientists can explore the events of the experiment and the processes that led to them.

This unlocks two significant benefits:

Reproducibility: Ensuring every experiment your data scientists run is reproducible.
Explainability: Making sure they can explain their experiment results.

Ensuring reproducible experiment results

The results of an experiment should be easy to reproduce so that your data scientists can collaborate better with each other and other teams and make a good workflow. View reproducibility as running the same code with the same environment configuration and on the same data to get the same or similar experiment results.

To make reproducibility work, you need to build components that keep track of the experiment metadata (such as the parameters, results, configuration files, model and data versions, and so on), code changes, and the data scientists’ training environment (or infrastructure) configurations.

Without end-to-end traceability and tracking of the lineage of data, it’s almost impossible for data scientists to reproduce models and fix errors and pipelines.

Your users should be able to track changes to the model development codebase (data processing code, pipeline code, utility scripts, et cetera) that directly influence how they run experiments and the corresponding results.

Making sure data scientists can explain experiment results

When data scientists run experiments and build models that meet expected performance requirements, they also need to understand the results to judge why their model makes certain predictions. This, of course, is not true in all situations, but in circumstances where they need to understand how and why your model makes predictions, “ML explainability” becomes crucial.

You can’t add explainability to their workflow if you can’t track where the experiment data originates from (its lineage), how it was processed, what parameters they used to run experiments, and, of course, what the results of those experiments were.

An experiment tracking tool should allow your data scientists to:

Examine other people’s experiments and easily share theirs.
Compare the behavior of any of the created experiments.
Trace and audit every experiment for unwanted bias and other problems.
Debug and compare experiments for which the training data, code, or parameters are missing.

Legal compliance is another reason why explainability is essential. For example, GDPR requires your organization to collect and keep track of metadata about the datasets and to document and report how the resulting model(s) from experiments work.

Monitor and evaluate experiment performance for effective decision-making

Most of the time, it makes sense to compare the results of experiments done with different dataset versions and parameters. An experiment tracking solution helps your data scientists measure the impact of changing model parameters on experiments. They will see how a model’s performance changes with different data versions.

Of course, this would be helpful for them to build robust and high-performing machine learning models. They can’t be sure that a trained model (or models) will generalize to unseen data without monitoring and evaluating their experiments. The data science team can use this information to choose the best model, parameters, and performance metrics.

Track the progress of a machine learning project

Using an experiment tracking solution, the data science team and other concerned stakeholders can check the progress of a project and see if it’s heading toward the expected performance requirements.

Functional and non-functional requirements

I would be preaching to the choir if I said a lot of thought goes into developing effective requirements for any software tool. First, you’d have to find out what the requirements are in relation to the business and product usage. Then you must specify, analyze, test, and manage them throughout the software development lifecycle.

Creating user stories, analyzing them, and validating requirements are all parts of requirements development that deserve their own article. This section will provide an overview of the most important functional and non-functional needs for an ideal tool to track experiments.

Understanding your users

Depending on your team structure and organizational setup, you might have different users that require an experiment tracking tool. But ideally, data scientists would be the users of your experiment tracking tool. At a high level, here are the jobs your data scientists would want to do with an experiment tracking tool:

See model training runs live: When training models on remote servers or away from their computer, they want to see model training runs live so they can react quickly when runs fail or analyze results when they complete.
See all model training metadata in one place: When working on a project with a team or by themselves, they want to have all the model-building metadata in one location so they can quickly find the best model metadata whenever they need it and have the assurance that it will always be there.
Compare model training runs: When they have different versions of models trained, they want to compare models and see which ones performed best, which parameters worked, and what inputs/outputs were different.

Functional requirements

In the previous section, you learned about the problems you solve with an experiment tracking tool; these are also the jobs to be done to build a functional experiment tracking tool.

To begin designing experiment tracking software, you must develop functional requirements that represent what an ideal experiment tracking tool should do. I have categorized the functional requirements in the table below, showing the need based on the jobs your users have to do and what the resulting feature should look like.

Need	Feature
Seamless integration with tools in the ecosystem.	Integrate with ML frameworks you leverage for experimentation (model training and data tools).
	Integrate with workflow orchestrators and CI/CD tools (if your stack is at this level).
Support for multiple data types for metadata logging.	Record simple data types like integer, float, string, et cetera.
	Record complex data types like series of floats, strings, images, files, and file directories.
Consume the logged metadata (both programmatically and via UI).	View a list of runs and run details, such as the data version it was trained on, model parameters, performance metrics, artifacts, owner, duration, and the time it was created.
	Sort and filter experiments by attributes. For example, you may want it to show only runs trained on version X of dataset Y with accuracy over 0.93, group them by the users that created them, and sort by creation time.
	Compare different experiments to see how different parameters, model architectures, and datasets affect model accuracy, cost of training, and hardware utilization.

Non-functional requirements

The non-functional requirements for an experiment tracking tool should include:

Quality	Description
Reliability	You don’t want your experiment tracking tool blowing up training jobs—that could be costly and, of course, should be a deal breaker.
Performance	The APIs and integrations need low latency so that it doesn’t slow down training jobs and ML pipelines (which cost money).
Efficiency	The architecture and technologies should be optimized for cost-effectiveness because your users can run and track many experiments, and the costs can quickly add up.
Scalability	ML will only grow in importance within your organization. You don’t want to end up in a situation where you need to rewrite a system due to some shortcuts you took early on when only one data scientist was using it.
Robustness	You need an elastic data model to support:Varying team sizes and structures (a single data scientist only, or maybe a team of one data scientist, 4 machine learning engineers, 2 DevOps engineers, etc.).Varying workflows so users can decide what they want to track. Some will only track the post-training phase. Some will want to track entire pipelines of data transformations, and others will monitor models in production.Ever-changing landscape—the data model needs to be able to support every new ML framework and tooling in the ecosystem. Without that, your integrations could quickly become hacky and unmaintainable.

Requirements for external integration: The integration should be set up so that the software can collect metadata about the datasets that users will use for experiments.

Architecture of the experiment tracking system

An ideal experiment tracking system will have three (3) layers:

1 Frontend/UI.
2 Backend.
3 Client library (API / ecosystem integration).

Once you understand what these components do and why we need them, you’ll be able to build a system tailored to your experiment tracking needs.

*Interactions between different layers of the experiment tracking software*

Building the frontend layer of the experiment system

The frontend intercepts most user requests and sends them to the backend servers to run any experiment tracking logic. Since most of your requests and responses have to go through the front-end layer, it will get a lot of traffic and needs to be able to handle the highest concurrency level.

The front-end layer is also how you can visualize and interact with the experiments you run. What are the most important parts of the front end of a system for tracking experiments?

Visualize experiment metadata

A lot of experimentation in data science deals with visualizations, from visualizing data to real-time monitoring of trained models and their performance. The front-end layer must be able to display all kinds of experiment metadata, from simple strings to embedded Jupyter Notebooks, source code, videos, and custom reports.

Display hundreds of runs with their attributes

You want to be able to look at the experiment details whenever, during, and after a run, including the related properties and logged metadata. Such metadata include:

Algorithms used.
Performance metrics and results.
Experiment duration.
Input dataset.
Time the experiment started.
Unique identifier for the run.
Other properties you think may be necessary.

You would also need to compare runs based on their results and potentially across experiments.

If it’s essential to your use case, you may want to add explainability features and attributes to your experiment using those attributes. And, of course, you may also want to promote or download your models from this view.

Managing state and optimizing performance are two of the most complex parts of building the UI component. Comparing, say, ten runs with thousands of attributes each, many of which need to be displayed on dynamic charts, can cause lots of headaches. Even medium-sized projects may experience constant browser freezes if you do them naively.

Aside from performance, there are other UI tweaks that can let you show only a subset of a project’s attributes, group runs by specific attributes, sort by others, use a filter query editor with hints and completion, and so on.

System backend

The system backend supports the logic of your experiment tracking solution. This layer is where you encode the rules of the experiment tracking domain and determine how data is created, stored, and modified.

The front end is one of the clients for this layer. You can have other clients, like integrations with a model registry, data quality monitoring components, etc. What would be effective, as with most traditional software, would be to create services in this layer and in the API layer, which you’ll learn about in the following section.

For a basic experiment tracking tool, you need to implement two primary components of the system backend you want to build out:

1 User, project, and workspace registry.
2 Actual tracking component.

User, project, and workspace registry

This component helps you manage users of the experiment tracking tool and track their activities with the experiments they run. The main things this component needs to do are:

Handle authentication and authorization,
Project administration and permissions,
Quotas (number of requests per second, amount of storage) per project or workspace.

What’s the level of permission detail you want to implement? You can choose between granular permissions, custom roles, and coarse, predefined roles.

Tracking component

The tracking component is the actual experiment tracking logic you need to implement. Here are some pieces you should consider implementing:

Attribute storage.
Blob and file storage.
Series storage.
Querying engine.

Attribute storage

Your runs have attributes (parameters, metrics, data samples, etc.), and you need a way to associate such data with the runs. This is where attribute storage and general data organization in a table come into play, so data lookups are easy for your users to perform. Of course, a relational database would be valuable here.

What is the level of consistency you want? Can you accept eventual consistency? Or would you rather have strong consistency at the cost of higher latency at the API layer?

Blob and file storage

Some attributes don’t easily fit into a database field, and you’d need a data model to handle this. Blob storage is a highly cost-effective solution. The main advantage is that you can store a vast amount of unstructured data. Your users might want to store source code, data samples (CSVs, images, pickled DataFrames, etc.), model weights, configuration files, etc. This solution comes in handy.

The key consideration to make here is the storage service’s long-term cost-effectiveness and low-latency access.

Series storage

You need to determine a way to store series, especially numeric series, which are attributes that require special attention. Depending on your use case, they could have tens to millions of elements. It could be challenging to store them in a way that lets the user access the data in the UI. You can also limit the number of series you support to, say, 1,000 elements, which is enough for many use cases.

The key considerations are:

1 Long-term storage cost-effectiveness.
2 The tradeoff between functionality.
3 And relative implementation simplicity.

Querying engine

You also need to add the feature to filter runs with very different structures, which means you need a robust database engine that can handle these kinds of queries. This is something that a simple relational database cannot do effectively when the amount of data is not trivial. An alternative is to severely limit the number of experiment attributes you can filter or group by. If you go more low-level, a few database hacks and tricks will be enough to work your way around this.

The key consideration here is the tradeoff between the number of attributes a user can filter, sort, or group by and the implementation simplicity.

Client library (API and ecosystem integration)

At the high level, the API layer is used to shield the clients from knowing the structure, organization, or even what service exposes a specific operation, which is very useful. It shouldn’t change anything or run logic like the backend layer does. Instead, it should offer a standard proxy interface that exposes the service endpoints and API operations you configure it to expose.

When building an experiment tracking tool, a raw (native) API usually isn’t enough. For the solution to be usable by users, it needs to be integrated seamlessly with their code. If you define the API layer first, clients will have minimal, if any, changes to make in response to the underlying refactoring of the codebase, as long as the API contract doesn’t change.

This, in practice, means you can have a library (preferably Python) do the heavy lifting of communicating with the backend servers for logging and querying data. It handles retries and backoffs; you probably want to implement a persistent and asynchronous queue from the start—persistent for data durability and asynchronous so that it doesn’t slow down the model training process for your users.

Since the experiment tracking tool will also need to work with your data, training, and model promotion tools, among others, your API layer needs to integrate with the tools in the ML and data ecosystem.

The ML and data ecosystem evolves, so building an integration is not the end. It needs to be tested with new versions of the tools it works with and updated when APIs become deprecated or, more often, when they change without warning. You can also solve this by directing users to the legacy versions of the integration without forcing them to make changes to their experimentation code.

Considerations for experiment tracking software architecture (backend)

Evaluating feasibility is a significant part of building the structure of your experiment tracking system and assessing if you are building it right. This means developing a high-level architecture based on the requirements you’ve established.

The architectural design should show how the layers and components you have analyzed should fit together in a layered architecture. By “layered architecture,” I mean distinguishing your frontend from your backend architecture. This article focuses on the considerations for the backend architecture, which is where the logic for experiment tracking is encoded.

Once you understand your backend architecture, you can also follow domain-driven design principles to build a frontend architecture.

Backend architecture layer

To build the system architecture, follow some software architectural principles that help make the structure as simple and efficient as possible. One of those principles describes modularity. You want the software to be modular and well-structured so that you can quickly understand the source code, save time building the system, and possibly reduce technical debt.

Your tool for keeping track of experiments will almost certainly evolve, so when you develop the first architecture, it will be rough and just something that works. Because your architecture will change over time, it will need to follow consistent design patterns to save time on maintenance and adding new features.

Here’s what the backend architecture layer for a basic experiment tracking solution looks like, taking the requirements and components listed earlier into account:

*Backend architecture of an ideal experiment tracking tool*

Using the parts explained earlier, you can find the different modules in the architecture and see how they work together.

Architectural consideration: separate authentication and authorization

You may have noticed in the architecture that authentication is separate from authorization. Since you may have different users, you would want them to validate their credentials through the authentication component.

Through the authorization component, the software administrator can manage permissions and access levels for each user. You can read more about the difference between authentication and authorization in this article.

The quota management part of the user management component would help manage the storage limit available to users.

Considerations you make before building an experiment tracking tool

For all it’s worth, you have learned at the highest level what it takes to build experiment tracking software. The natural next question then becomes, “What considerations must I make before building an experiment tracking tool?”

Well, maybe that’s a silly question and not one you’d ask if your organization’s strategic software—the core product offering—is an experiment tracking tool. If it isn’t, you should keep reading!

You are probably familiar with this software development dilemma already: build vs. buy. If the experiment tracking tool is your organization’s operational software (to support the regular operations of the data and ML teams), the next important consideration to make to answer this question would be the level of maturity of your organization in terms of:

Experience,
Talent,
And resources (time and money).

Illustration of considerations when deciding whether to build or buy an experiment tracking tool. — *Considerations when deciding whether to build or buy an experiment tracking tool*

Let’s look at some of the questions you’ll need to ask yourself when trying to make this decision.

Has your organization ever developed an experiment tracking tool?

For example, if your organization has never developed an experiment tracker, it would take more time, trial, and error to get up to speed on industry standards, best practices, security, and compliance. If a “foundation” to build upon doesn’t exist, especially from your engineering team, hacky builds may fall short of industry standards and company expectations.

You and other stakeholders must consider whether a company can afford the trial and error or whether an effective, safer, and more reliable off-the-shelf solution is required.

What talent is available to build?

If you are a product or software company, there’s a chance that you already have software developers working on your strategic offering. You have to consider the opportunity cost of putting your internal developers’ skills into building an experiment tracking tool instead of utilizing their skills to improve your main product.

Developing an experiment tracker in-house could significantly improve your product or the efficiency of your ML team’s workflow. Still, it could also take the skills and time of your developers away from building other things that would be more meaningful differentiators or end up wasting time and effort.

What would it cost us to build?

The costs of building an experiment tracker in-house are mostly maintenance costs. Can the organization bear the cost of keeping the software up to date, fixing bugs, and adding new features?

The budget covers the infrastructure needed to keep the tool running and hire new people in case the original developers leave. Think about the long-term effects of a pricey and time-consuming software development project, not just the short-term savings.

A big part of lowering maintenance costs is reducing the chance of high, unexpected expenses coming up all at once. In other words, as with general maintenance costs, the best time to manage risk is early in the software lifecycle when determining whether something needs to be built or outsourced.

How long would it take to build the experiment tracking tool?

You have to consider the opportunity costs. For example, if it takes two months to make a custom tool, what else will you be able to build in that time? Would it take longer than two months to implement the tracking component? How many development cycles would it take to build the tool from a POC to a final, stable release?

If you’re part of a larger enterprise and have enough development cycles to build an experiment tracker, it may not be a problem. How about when it’s not, particularly a problem unique to your use case? Getting an off-the-shelf solution that integrates with your stack so you can focus on building your strategic software may work better.

It’s unlikely that you’ll have enough cycles to get to the level of sophistication and feature richness you need from a standard experiment management tool, so it may not be worth dedicating your development cycles to it.

For example, at neptune.ai, we have spent the past five years solely focusing on building a robust metadata store to manage all the model-building metadata for your users and track their experiments. We have learned from customers and the ML community and continuously improved the product to be robust across various use cases and workloads.

Faster coding at the expense of design and architecture is almost always the wrong choice. This is especially the case if the software is operational because you have less time to focus on good architecture and build the features it must have. After all, it’s not strategic software that directly impacts or defines your organization.

Final thoughts

We have covered quite a lot in this article; let’s recap a few key takeaways:

Experiment tracking is an intense activity. Using the right tools and automation makes it user-friendly and efficient.
Developing effective requirements for an experiment tracking tool should require: considering what users will leverage the software and how it’ll integrate with your data stack and your downstream services. Basically, thinking about the bare minimum requirements needed to get you up and running—no sophistication involved.
The backend layer of the experiment tracking software is the most essential layer to implement. You need to make sure that you implement the tracking component and workspace registry to manage user sessions smoothly.

Your objective for building an experiment tracker is to make sure you provide the component for your users to log experiment data, track those experiments, and securely collaborate on them. If they can do those with little to no overhead, then you have likely built what works for your team.

In most cases, we see that building the first version is only the first step—especially if it’s not a core software component for your organization. You may find it challenging to add new features as the list of user requirements increases because of newer problems.

You and the relevant stakeholders involved would need to consider the value of dedicating resources to keep up with the needs of your users and, potentially, industry standards.

Next steps

So where do you go from here? Excited to build? Or do you think taking some off-the-shelf solution would do?

The majority of model registry platforms we considered assumed a specific shape for storing models. This took the form of experiments—each model was the next in a series. In Stitch Fix’s case, our models do not follow a linear targeting pattern.

They could be applicable to specific regions, business lines, experiments, etc., and all be somewhat interchangeable with each other. Easy management of these dimensions was paramount to how data scientists needed to access their models. — Elijah Ben Izzy and Stefan Krawczyk in Deployment for Free; A Machine Learning Platform for Stitch Fix’s Data Scientists

That was Stefan Krawczyk talking about why they decided to build a model registry instead of using existing solutions. In the same context, unless you have special user requirements that existing open source or paid solutions do not meet, building and maintaining an experiment tracking tool may not be the most efficient use of developer time and effort.

Would you like to delve deeper or just chat about building an experiment tracker? Reach out to us; we’d love to exchange experiences.

Of course, if you decide to skip building and use Neptune instead, feel free to sign up and give it a try first. Get in touch with us if you have any questions!