MLOps Blog

How to Build ML Model Training Pipeline

Henrique Pett

10 min

7th June, 2023

ML Model Development

Hands up if you’ve ever lost hours untangling messy scripts or felt like you’re hunting a ghost while trying to fix that elusive bug, all while your models are taking forever to train. We’ve all been there, right? But now, picture a different scenario: Clean code. Streamlined workflows. Efficient model training. Too good to be true? Not at all. In fact, that’s exactly what we’re about to dive into. We’re about to learn how to create a clean, maintainable, and fully reproducible machine learning model training pipeline.

In this guide, I’ll give you a step-by-step process to building a model training pipeline and share practical solutions and considerations to tackling common challenges in model training, such as:

1 Building a versatile pipeline that can be adapted to various environments, including research and university settings like SLURM.
2 Creating a centralized source of truth for experiments, fostering collaboration and organization.
3 Integrating Hyperparameter Optimization (HPO) seamlessly when required.

Complete ML model training pipeline workflow | Source

But before we delve into the step-by-step model training pipeline, it’s essential to understand the basics, architecture, motivations, challenges associated with ML pipelines, and a few tools that you will need to work with. So let’s begin with a quick overview of all of these.

Bookmark for later

Building MLOps Pipeline for NLP: Machine Translation Task [Tutorial]

Building MLOps Pipeline for Time Series Prediction [Tutorial]

Why do we need a model training pipeline?

There are several reasons to build an ML model training pipeline (trust me!):

Efficiency: Pipelines automate repetitive tasks, reducing manual intervention and saving time.

Consistency: By defining a fixed workflow, pipelines ensure that preprocessing and model training steps remain consistent throughout the project, making it easy to transition from development to production environments.

Modularity: Pipelines enable the easy addition, removal, or modification of components without disrupting the entire workflow.

Experimentation: With a structured pipeline, it’s easier to track experiments and compare different models or algorithms. It makes the training iterations fast and trustable.

Scalability: Pipelines can be designed to accommodate large datasets and scale as the project grows.

ML model training pipeline architecture

An ML model training pipeline typically consists of several interconnected components or stages. These stages form a directed acyclic graph (DAG) to represent the order of execution. A typical pipeline may include:

Data Ingestion: The process begins with ingesting raw data from different sources, such as databases, files, or APIs. This step is crucial to ensure that the pipeline has access to relevant and up-to-date information.

Data Preprocessing: Raw data often contains noise, missing values, or inconsistencies. The preprocessing stage involves cleaning, transforming, and encoding the data, making it suitable for machine learning algorithms. Common preprocessing tasks include handling missing data, normalization, and categorical encoding.

Feature Engineering: In this stage, new features are created from the existing data to improve model performance. Techniques such as dimensionality reduction, feature selection, or feature extraction can be employed to identify and create the most informative features for the ML algorithm. Business knowledge can come in handy at this step of the pipeline.

Model Training: The preprocessed data is fed into the chosen ML algorithm to train the model. The training process involves adjusting the model’s parameters to minimize a predefined loss function, which measures the difference between the model’s predictions and the actual values.

Model Validation: To evaluate the model’s performance, a validation dataset (a portion of the data that the model never saw) is used. Metrics such as accuracy, precision, recall, or F1-score can be employed to assess how well the model generalizes to new (unseen data) in classification problems.

Hyperparameter Tuning: Hyperparameters are the parameters of the ML algorithm that are not learned during the training process but are set before training begins. Tuning hyperparameters involves searching for the optimal set of values that minimize the validation error and helps achieve the best possible model’s performance.

Check also

MLOps Architecture Guide

Model training pipeline tools

There are various options for implementing training pipelines, each with its own set of features, advantages, and use cases. When choosing a training pipeline option, consider factors such as your project’s scale, complexity, and requirements, as well as your familiarity with the tools and technologies.

Here, we’ll explore some common pipeline options, including built-in libraries, custom pipelines, and end-to-end platforms.

Built-in libraries: Many machine learning libraries come with built-in support for creating pipelines. For example, Scikit-learn, a popular Python library, offers the Pipeline class to streamline preprocessing and model training. This option is beneficial for smaller projects or when you’re already familiar with a specific library.
Custom pipelines: In some cases, you might need to build a custom pipeline tailored to your project’s unique requirements. This can involve writing your own Python scripts or utilizing general-purpose libraries like Kedro or MetaFlow. Custom pipelines offer the flexibility to accommodate specific data sources, preprocessing steps, or deployment scenarios.
End-to-end platforms: For large-scale or complex projects, end-to-end machine learning platforms can be advantageous. These platforms provide comprehensive solutions for building, deploying, and managing ML pipelines, often incorporating features such as data versioning, experiment tracking, and model monitoring. Some popular end-to-end platforms include:

TensorFlow Extended (TFX): An end-to-end platform developed by Google, TFX offers a suite of components for building production-ready ML pipelines with TensorFlow.

Kubeflow Pipelines: Kubeflow is an open-source platform designed to run on Kubernetes, providing scalable and reproducible ML workflows. Kubeflow Pipelines offers a platform to build, deploy, and manage complex ML pipelines with ease.

MLflow: Developed by Databricks, MLflow is an open-source platform that simplifies the machine learning lifecycle. It offers tools for managing experiments, reproducibility, and deployment of ML models.

May be useful

If you’d like to avoid setting up and maintaining MLflow yourself, you can check neptune.ai. It’s an out-of-the-box experiment tracker, offering user access management (great alternative if you work in a highly collaborative environment).

You can check the differences between MLflow and neptune.ai here.

Apache Airflow: Although not exclusively designed for machine learning, Apache Airflow is a popular workflow management platform that can be used to create and manage ML pipelines. Airflow provides a scalable solution for orchestrating workflows, allowing you to define tasks, dependencies, and schedules using Python scripts.

While there are various options for creating a pipeline, most of them don’t offer a built-in way to monitor your pipeline/models and log your experiments. To address this issue, you can consider connecting a flexible experiment tracking tool to your existing model training setup. This approach provides enhanced visibility and debugging capabilities with minimal additional effort

We will build something exactly like this in the upcoming section.

Challenges around building model training pipelines

Despite the advantages, there are some challenges when building an ML model training pipeline:

Complexity: Designing a pipeline requires understanding the dependencies between components and managing intricate workflows.
Tool selection: Choosing the right tools and libraries can be overwhelming due to the vast number of options available.
Integration: Combining different tools and technologies may require custom solutions or adapters, which can be time-consuming to develop.
Debugging: Identifying and fixing issues within a pipeline can be difficult due to the interconnected nature of the components.

How to build an ML model training pipeline?

In this section, we will walk through a step-by-step tutorial on how to build an ML model training pipeline. We will use Python and the popular Scikit-learn. Then we will use Optuna to optimize the hyperparameters of the model, and finally, we’ll use neptune.ai to log your experiments.

For each step of the tutorial, I’ll explain what is being done and will break down the code for you to make it easier to understand. This code will follow Machine Learning best practices, which means that it will be optimized and completely reproducible. Besides, in this example, I’m using a static dataset, so I’ll not be performing any operation such as data ingestion and feature engineering.

Let’s get started!

1. Install and import the required libraries.

This step installs necessary libraries for the project, such as NumPy, pandas, scikit-learn, Optuna, and Neptune. It then imports these libraries into the script, making their functions and classes available for use in the tutorial

Install the required Python packages using pip.

pip install --quiet numpy==1.22.4 optuna==3.1.0 pandas==1.4.4 scikit-learn==1.2.2 neptune-client==0.16.16

Import the necessary libraries for data manipulation, preprocessing, model training, evaluation, hyperparameter optimization, and logging.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import optuna
from functools import partial
import neptune.new as neptune

2. Initialize the Neptune run and connect to your project.

Here, we initialize a new run in Neptune, connecting it to a Neptune project. This allows us to log experiment data and track your progress.

You’ll need to replace the placeholder values with your API token and project name.

run = neptune.init_run(api_token='your_api_token', project='username/project_name')

3. Load the dataset.

In this step, we load the Titanic dataset from a CSV file into a pandas DataFrame. This dataset contains information about passengers on the Titanic, including their survival status.

data = pd.read_csv("train.csv")

4. Perform some basic preprocessing, such as dropping unnecessary columns.

Here, we drop columns that are not relevant to the machine learning model, such as PassengerId, Name, Ticket, and Cabin. This simplifies the dataset and reduces the risk of overfitting.

data = data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)

5. Split the data into features and labels.

We separate the dataset into input features (X) and the target label (y). The input features are the independent variables that the model will use to make predictions, while the target label is the “Survived” column, indicating whether a passenger survived the Titanic disaster.

X = data.drop("Survived", axis=1)

y = data["Survived"]

6. Split the data into training and testing sets.

You split the data into training and testing sets using the train_test_split function from scikit-learn. This ensures that you have separate data for training the model and evaluating its performance. The stratifty parameter is used to maintain the proportion of classes in both the training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

7. Define the preprocessing steps.

We create a ColumnTransformer that preprocesses numerical and categorical features separately.
Numerical features are processed using a pipeline that imputes missing values with the mean and scales the data using standardization.
Categorical features are processed using a pipeline that imputes missing values with the most frequent category and encodes them using one-hot encoding.

numerical_features = ["Age", "Fare"]
categorical_features = ["Pclass", "Sex", "Embarked"]

num_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipeline, numerical_features),
        ('cat', cat_pipeline, categorical_features)
    ],
    remainder='passthrough'
)

8. Create the ML model.

In this step, we create a RandomForestClassifier model from scikit-learn. This is an ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.

model = RandomForestClassifier(random_state=42)

9. Build the pipeline.

We create a Pipeline object that includes the preprocessing steps defined in step 7 and the model created in step 8.
The pipeline automates the entire process of preprocessing the data and training the model, making the workflow more efficient and easier to maintain.

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', model)
])

10. Perform cross-validation using StratifiedKFold.

We perform cross-validation using the StratifiedKFold method, which splits the training data into K folds, maintaining the proportion of classes in each fold.
The model is trained K times, using K-1 folds for training and one fold for validation. This gives a more robust estimate of the model’s performance.
We save each of the scores and the mean on our Neptune run.

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='accuracy')

run["cross_val_accuracy_scores"] = cv_scores

run["mean_cross_val_accuracy_scores"] = np.mean(cv_scores)

11. Train the pipeline on the entire training set.

We train the model through this pipeline, using the entire training dataset.

pipeline.fit(X_train, y_train)

Here’s a snapshot of what we created.

Workflow of the model training pipeline made on the example | Source: Author

12. Evaluate the pipeline with multiple metrics.

We evaluate the pipeline on the test set using various performance metrics, such as accuracy, precision, recall, and F1-score. These metrics provide a comprehensive view of the model’s performance and can help identify areas for improvement.
We save each of the scores on our Neptune run.

y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

run["accuracy"] = accuracy
run["precision"] = precision
run["recall"] = recall
run["f1"] = f1

13. Define the hyperparameter search space using Optuna.

We create an objective function that receives a trial and trains and evaluates the model based on the hyperparameters sampled during the trial.
The objective function is the heart of the optimization process. It takes the trial object, which contains the hyperparameter values sampled by Optuna, and trains the pipeline with these hyperparameters. The cross-validated accuracy score is then returned as the objective value to be optimized.

def objective(X_train, y_train, pipeline, cv, trial: optuna.Trial):
    params = {
        'classifier__n_estimators': trial.suggest_int('classifier__n_estimators', 10, 200),
        'classifier__max_depth': trial.suggest_int('classifier__max_depth', 10, 50),
        'classifier__min_samples_split': trial.suggest_int('classifier__min_samples_split', 2, 10),
        'classifier__min_samples_leaf': trial.suggest_int('classifier__min_samples_leaf', 1, 5),
        'classifier__max_features': trial.suggest_categorical('classifier__max_features', ['auto', 'sqrt'])
    }
    
    pipeline.set_params(**params)
    
    scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='accuracy', n_jobs=-1)
    mean_score = np.mean(scores)
    
    return mean_score

If you found the code above overwhelming, here’s a quick breakdown of it:

Define the hyperparameters using the trial.suggest_* methods. These methods tell Optuna the search space for each hyperparameter. For example, trial.suggest_int(‘classifier__n_estimators’, 10, 200) specifies an integer search space for the n_estimators parameter, ranging from 10 to 200.
Set the pipeline’s hyperparameters using the pipeline.set_params(**params) method. This method takes the dictionary params containing the sampled hyperparameters and sets them for the pipeline.
Calculate the cross-validated accuracy score using the cross_val_score function. This function trains and evaluates the pipeline using cross-validation with the specified cv object and the scoring metric (in this case, ‘accuracy’).
Calculate the mean of the cross-validated scores using np. mean(scores) and return this value as the objective value to be maximized by Optuna.

14. Perform hyperparameter tuning with Optuna.

We create a study with a specified direction (maximize) and sampler (TPE sampler).
Then, we call study.optimize with the objective function, the number of trials, and any other desired options.
Optuna will run multiple trials, each with different hyperparameter values, to find the best combination that maximizes the objective function (mean cross-validated accuracy score).

study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=42))

study.optimize(partial(objective, X_train, y_train, pipeline, cv), n_trials=50, timeout=None, gc_after_trial=True)

15. Set the best parameters and train the pipeline.

After Optuna finds the best hyperparameters, we set these parameters in the pipeline and retrain it using the entire training dataset. This ensures that the model is trained with the optimized hyperparameters.

pipeline.set_params(**study.best_trial.params)

pipeline.fit(X_train, y_train)

16. Evaluate the best model with multiple metrics.

We evaluate the performance of the optimized model on the test set using the same performance metrics as before (accuracy, precision, recall, and F1-score). This allows you to compare the performance of the optimized model with the initial model.
We save each of the scores of the tuned model on our Neptune run.

y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

run["accuracy_tuned"] = accuracy
run["precision_tuned"] = precision
run["recall_tuned"] = recall
run["f1_tuned"] = f1

If you run this code and look only at the performance of these metrics, we might think that the tuned model is worse than before. However, if you look at the mean cross-validated score, a more robust way to evaluate your model, you’ll realize that the tuned model performs well on the whole dataset, making it more reliable.

17. Log the hyperparameters, best trial parameters, and the best score on Neptune.

You log the best trial parameters and corresponding best score in Neptune, enabling you to keep track of your experiment’s progress and results.

run['parameters'] = study.best_trial.params
run['best_trial'] = study.best_trial.number
run['best_score'] = study.best_value

18. Log the classification report and confusion matrix.

You log the classification report and confusion matrix for the model, providing a detailed view of the model’s performance for each class. This can help you identify areas where the model may be underperforming and guide further improvements.

from sklearn.metrics import classification_report, confusion_matrix

y_pred = pipeline.predict(X_test)

# Log classification report
report = classification_report(y_test, y_pred, output_dict=True)
for label, metrics in report.items():
    if isinstance(metrics, dict):
        for metric, value in metrics.items():
            run[f'classification_report/{label}/{metric}'] = value
    else:
        run[f'classification_report/{label}'] = metrics

# Log confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
conf_mat_plot = px.imshow(conf_mat, labels=dict(x="Predict", y="Target"), x=[x+1 for x in range(len(conf_mat[0]))], y=[x+1 for x in range(len(conf_mat[0]))])
run['confusion_matrix'].upload(neptune.types.File.as_html(conf_mat_plot))

19. Log the pipeline as a pickle file.

You save the pipeline as a pickle file and upload it to Neptune. This allows you to easily share, reuse, and deploy the trained model.

import joblib

joblib.dump(pipeline, 'optimized_pipeline.pkl')
run['optimized_pipeline'].upload(neptune.types.File.as_pickle('optimized_pipeline.pkl'))

20. Stop the Neptune run.

Finally, you stop the Neptune run, signalling that the experiment is complete. This ensures that all data is saved and all resources are freed up.

run.stop()

Here’s a dashboard you can build using Neptune. As you can see, it contains information about our model (hyperparameters), classification report metrics, and the confusion matrix.

A general dashboard in the neptune.ai app with example experiment’s data logged — The general dashboard in Neptune with example experiment’s data logged | Play with this project live

To demonstrate the power of using a tool like Neptune for tracking and comparing your training experiments, we created another run by changing the scoring parameter to ‘recall’ in the Optuna objective function. Here is a comparison of both runs.

The compare runs feature in Neptune | Play with this project live

Such comparison allows you to have everything in one place and make informed decisions based on the performance of each pipeline iteration.

If you made it this far, you have probably implemented the training pipeline with all the necessary accessories.

This particular example showed how an experiment tracking tool can be integrated with your training pipeline, offering a personalized view for your project and increased productivity.

If you’re interested in replicating this approach, you can explore solutions like the combination of Kedro and Neptune, which work well together for creating and tracking pipelines. Here you can find examples and documentation on how to use Kedro with Neptune.

Dig deeper

Here’s a nice case study on how ReSpo.Vision tracks their pipelines with Neptune

To sum it all up, here is a small flowchart of all the steps we took to create and optimize our pipeline and to track the metrics generated by it. Irrespective of the problem you are trying to solve, major steps remain the same in any such exercise.

Steps to create and optimize model training pipeline and to track the metrics generated by it | Source: Author

Training your ML model in a distributed fashion

So far, we have talked about how to create a pipeline for training your model, but what if you are working with large datasets or complex models, in that case, you might want to look at Distributed Training.

By distributing the training process across multiple devices, you can significantly speed up the training process and make it more efficient. In this section, we will briefly touch upon the concept of distributed training and how you can incorporate it into your pipeline.

Choose a distributed training framework: There are several distributed training frameworks available, such as TensorFlow’s tf.distribute, PyTorch’s torch.distributed, or Horovod. Select the one that is compatible with your ML library and best suits your needs.

Set up your local cluster: To train your model on a local cluster, you need to configure your computing resources appropriately. This includes setting up a network of devices (such as GPUs or CPUs) and ensuring they can communicate efficiently.

Adapt your training code: Modify your existing training code to utilize the chosen distributed training framework. This may involve changes to the way you initialize your model, handle data loading, or perform gradient updates.

Monitor and manage the distributed training process: Keep track of the performance and resource usage of your distributed training process. This can help you identify bottlenecks, ensure efficient resource utilization, and maintain stability during the training.

While this topic is beyond the scope of this article, it’s essential to be aware of the complexities and considerations of distributed training when building ML model training pipelines in case you want to move towards it in the future. To effectively incorporate distributed training in your ML model training pipelines, here are some useful resources:

For TensorFlow users: Distributed training with TensorFlow
For PyTorch users: Getting Started with Distributed Data Parallel
For Horovod users: Horovod’s Official Documentation
For a general overview: Neptune’s Distributed Training: Guide for Data Scientists
If you’re planning to work with distributed training on a specific cloud platform, make sure to consult the relevant tutorials available in the platform’s documentation.

These resources will help you enhance your ML model training pipelines by enabling you to leverage the power of distributed training.

Best practices you should consider when building model training pipelines

A well-designed training pipeline ensures reproducibility and maintainability throughout the machine learning process. In this section, we’ll explore few best practices for creating effective, efficient, and easily adaptable pipelines for different projects.

Split your data before any manipulation: It is crucial to split your data into training and testing sets before doing any preprocessing or feature engineering. This ensures that your model evaluation is unbiased and that you are not inadvertently leaking information from the test set into the training set, which could lead to overly optimistic performance estimates.

Separate data preprocessing, feature engineering, and model training steps: Breaking down the pipeline into these distinct steps makes the code easier to understand, maintain, and modify. This modularity allows you to easily change or extend any part of the pipeline without affecting the others.

Use cross-validation to estimate model performance: Cross-validation helps you to get a better estimate of your model’s performance on unseen data. By dividing the training data into multiple folds and iteratively training and evaluating the model on different combinations of these folds, you can get a more accurate and reliable estimate of the model’s true performance.

Stratify your data during train-test splitting and cross-validation: Stratification ensures that each split or fold has a similar distribution of the target variable, which helps to maintain a more representative sample of the data for training and evaluation. This is particularly important when dealing with imbalanced datasets, as stratification helps to avoid creating splits with very few examples of the minority class.

Use a consistent random seed for reproducibility: By setting a consistent random seed in your code, you ensure that the random number generation used in your pipeline is the same every time the code is run. This makes your results reproducible and easier to debug, as well as allowing other researchers to reproduce your experiments and validate your findings.

Optimize hyperparameters using a search method: Hyperparameter tuning is an essential step to improve the performance of your model. Grid search, random search, and Bayesian optimization are common methods to explore the hyperparameter search space and find the best combination of hyperparameters for your model. Optuna is a powerful library that can be used for hyperparameter optimization.

Use a version control system and log experiments: Version control systems like Git help you keep track of changes in your code, making it easier to collaborate with others and revert to previous versions if needed. Experiment tracking tools like Neptune help you log and visualize the results of your experiments, track the evolution of model performance, and compare different models and hyperparameter settings.

Document your pipeline and results: Good documentation makes your work more accessible to others and helps you understand your own work better. Write clear and concise comments in your code, explaining the purpose of each step and function. Use tools like Jupyter Notebooks, Markdown, or even comments in the code to document your pipeline, methodology, and results.

Automate repetitive tasks: Use scripting and automation tools to streamline repetitive tasks like data preprocessing, feature engineering, and hyperparameter tuning. This not only saves you time but also reduces the risk of errors and inconsistencies in your pipeline.

Test your pipeline: Write unit tests to ensure that your pipeline is working as expected and to catch errors before they propagate through the entire pipeline. This can help you identify issues early and maintain a high-quality codebase.

Periodically review and refine your pipeline during training: As your data evolves or your problem domain changes, it’s crucial to review your pipeline to ensure its performance and effectiveness. This proactive approach keeps your pipeline current and adaptive, maintaining its efficiency in the face of changing data and problem domains.

Recommended for you

Building ML Pipeline: 6 Problems & Solutions [From a Data Scientist’s Experience]

Conclusion

In this tutorial, we have covered the essential components of building a machine learning training pipeline using Scikit-learn and other useful tools such as Optuna and Neptune. We demonstrated how to preprocess data, create a model, perform cross-validation, optimize hyperparameters, and evaluate model performance on the Titanic dataset. By logging the results to Neptune, you can easily track and compare your experiments to improve your models further.

By following these guidelines and best practices, you can create efficient, maintainable, and adaptable pipelines for your Machine Learning projects. Whether you are working with the Titanic dataset or any other dataset, these principles will help you streamline the process and ensure reproducibility across different iterations of your work.