You’ll probably underestimate the duration of your data science project: here’s why

Antonio Campello
6 min readNov 19, 2022
Credits to Aron Visuals on Unsplash

Very often a data science practitioner will become an accidental project manager. Perhaps their project team is too small to have a full-time project manager, or perhaps, even in the presence of one, the data scientist will still need to set, and be accountable for, granular tasks associated to a more high-level project milestone. For instance, in a big project team the data scientist might be responsible for the task of “developing a predictive pricing model”, which itself consists of smaller tasks, experiments, nuances and dependencies that are too granular to fit a traditional project management structure.

When the time comes to act as an unofficial project manager, this usually entails navigating multiple moving parts, such as stakeholder engagement, setting expectations, understanding the business value of the project or task and establishing success metrics. From my experience, data scientists with good soft skills will usually thrive on the above. On the contrary, one factor I’ve seen data scientists repeatedly underestimate is the duration of a task.

It will happen to all of us. We will become excited about a project, when the main stakeholder will casually fire away: “how long do you think this is going to take to complete?”. And almost invariably we will underestimate the time to completion.

Below are two fictional characters based on true stories that exemplify common pitfalls in time estimation.

Chris the Coder

A senior stakeholder wanted a data science colleague (let’s call him Chris the Coder) to train a model to classify relevant and non-relevant news articles from a database their company purchased.

“Hey Chris, how long does it take to train a model to identify relevant and non-relevant news?”

“Well, depending on the size of the data, I need one to two days to process it, then a couple more days to fine tune the model (I’ll start with a baseline). From that, I can generate predictions, so the main stakeholders can double check it, and based on that iterations, a couple more days of fine-tuning and 1 day to document it and wrap it into a reusable piece of code”

Based on Chris’s estimate, it would take about 7 days to train a model, or perhaps 10 days with a multiplicative factor. However, this specific project took five months to complete and deliver the final results, for much disappointment of the senior stakeholder.

How can Chris have mis-estimated so much the time for completion of this project? As it happens, Chris ignored the old project management axiom that tells us that work is different from duration. Chris estimated how much time his work was going to take. He did not estimate the fact that to get the news data, he would need to have access to a table that only two people can grant him after approval at an engineering meeting that happens every quarter. He did not estimate the fact that, once he got the dataset, the tables were so disorganised that he had to arrange meetings with the asset owners for an explanation. He did not estimate the fact that the dataset was so big that even visualising it took a whole day, so he had to resort to one of his work “allies”, a busy data engineer, to pre-process it for him. Ultimately, he did not estimate the duration of dependencies, which increased the total time to completion of the project by an order of magnitude!

Ellie the Estimator

Another common example of mis-estimating duration is when someone is brought into a project mid-way to “help” a colleague that perhaps will be on leave for a long time. Or perhaps an old project that needs to be revived, and the team that worked on it is long gone. This scenario occurred to my friend Ellie, who is known to be well-versed in project management. Here is the conversation with her manager.

“Last year we did a project about time-series forecasting of impressions. The person who trained the model moved on, but now one of the senior stakeholders wants to update the project to include new data. I feel like it should take just one or two days, what do you think?”

“I need to have a look at the project for at least two days to try and understand the steps to update the data, and then I can come back to you with an estimate.”

“Charting goals and progress” by Isaac Smith on Unsplash

Ellie’s answer caused a bit of disappointment to her manager, however it saved her time in the long run. What could have seen as “wasting two days” just trying to understand the project, provided a good picture of the task. Had she not “wasted two days”, she would’ve not noticed that:

  • The code lacked clear documentation. She was not very familiar with the model, and the accompanying power-point presentations did not help clarify how to reproduce the results.
  • Some of the features seemed to have been hardcoded, with very little explanation of why they had to be that way
  • The trained model seemed to be specific to that year, and needed to be retrained if she wanted it to be accurate
  • One part of the code said “manually verified by Carlos”, which implied that another stakeholder had to verify the data once updated.

Each of those steps are anticipated risks and add to the total duration of the project (as well as the probability of success altogether). None of them are complete blockers, however they have to be acted upon. In communicating it clearly, she could give a more precise deadline to her manager, de-escalating the expectations.

These two examples show us that estimating your own work is usually only one element of assessing the excepted duration of your project. We know how long we take to train a machine learning model, to produce a graph, or to develop a basic browser application. However to estimate the duration of a task we need to consider factors which are often out of our control. Here is a list of common themes, by no means exhaustive:

  • How long is it going to take to get access to the data?
  • Do I have the necessary permissions? Who are the people who can grant me access, and how swiftly can they do so?
  • How organised is the data? Is there a dictionary? Do I have access to asset owners that understand it?
  • When arriving at a project mid-way, how well-document is project?
  • If someone else in your team has to provide input to the process, how busy are they and how reliable are their deadlines?
  • If you need input from stakeholders, ditto.

It is often useful to write down the tasks into a table, identifying which ones are crucial, a tool project managers like to call dependency mapping. The specific software you use is not relevant — different teams will use different software to record tasks (or if you are a post-it person, go for it!). Below is a high-level example of how dependencies can be considered:

Simple high-level overview of a data science dependency map

Having an accurate dependency map is a hard task, and you will likely need to update it periodically, adjusting for unforeseen circumstances. On the other hand, not having a map is a certainty of underestimating the duration of a project. The map has to be visible to main stakeholders and reflect your best guess of each task. Rest assured the initial time invested thinking about your tasks is time very well spent!

You can be the fastest data scientist on earth in terms of writing code and yet don’t meet your deadlines if you don’t factor in the duration of your dependencies.

If you have any other tools to help you estimate the duration of a data science project, please comment below!

NB: The terms “accidental” or “unofficial” project manager come from two books I highly recommend: Project Management for the Unofficial Project Manager by Kary Kagon, Suzette Blakemore and James Wood and Accidental Agile Project Manager: Zero to Hero in 7 Iterations by Ray Frohnhoefer and other authors.

BECOME a WRITER at MLearning.ai

--

--

Antonio Campello

Data science. All things data governance, machine learning and open data.