MLOps Blog

How to Build an End-To-End ML Pipeline

12 min
3rd January, 2024

One of the most prevalent complaints we hear from ML engineers in the community is how costly and error-prone it is to manually go through the ML workflow of building and deploying models. They run scripts manually to preprocess their training data, rerun the deployment scripts, manually tune their models, and spend their working hours keeping previously developed models up to date. 

Building end-to-end machine learning pipelines lets ML engineers build once, rerun, and reuse many times. It lets them focus more on deploying new models than maintaining existing ones.

If all goes well, of course šŸ˜‰

In this article, you will: 

  • 1 Explore what the architecture of an ML pipeline looks like, including the components.Ā 
  • 2 Learn the essential steps and best practices machine learning engineers can follow to build robust, scalable, end-to-end machine learning pipelines.
  • 3 Quickly build and deploy an end-to-end ML pipeline with Kubeflow Pipelines on AWS.
  • 4 Learn the challenges of building end-to-end ML pipelines and the best practices to build them.Ā 

What is a machine learning pipeline?

Machine learning pipelines are composed of a sequence of linked components or steps that define the machine learning workflow to solve specific problems. The pipelines let you orchestrate the steps of your ML workflow that can be automated. The orchestration here implies that the dependencies and data flow between the workflow steps must be completed in the proper order.

A ā€œstandardā€ ML workflow with three phases
A ā€œstandardā€ ML workflow has three phases: data acquisition and feature management, experiment management and model development, and model management | Source: Author

You would build a pipeline to:

  • Achieve reproducibility in your workflow (running the pipeline repeatedly on similar inputs will provide similar outputs).
  • Simplify the end-to-end orchestration of the multiple steps in the machine learning workflow for projects with little to no intervention (automation) from the ML team.
  • Reduce the time it takes for data and models to move from the experimentation phase to the production phase.
  • Allow your team to focus more on developing new solutions than maintaining existing ones using modular components that offer automation for your workflow.
  • Make it easy to reuse components (a specific step in the machine learning workflow) to create and deploy end-to-end solutions that integrate with external systems without rebuilding each time.

Machine learning pipeline vs machine learning platform

The ML pipeline is part of the broader ML platform. It is used to streamline, orchestrate, and automate the machine learning workflow within the ML platform.

Pipelines and platforms are related concepts in MLOps, but they refer to different aspects of the machine learning workflow. An ML platform is an environment that standardizes the technology stack for your ML/AI team and provides tools, libraries, and infrastructure for developing, deploying, and operationalizing machine learning applications. 

The platform typically includes components for the ML ecosystem like data management, feature stores, experiment trackers, a model registry, a testing environment, model serving, and model management. It is designed to provide a unified and integrated environment for primarily data scientists and MLEs to develop and deploy models, manage data, and streamline the machine learning workflow.

The architecture of a machine learning pipeline

The machine learning pipeline architecture can be a real-time (online) or batch (offline) construct, depending on the use case and production requirements. To keep concepts simple in this article, you will learn what a typical pipeline looks like without the nuances of real-time or batch constructs. 

Semi Koen’s article gives detailed insight into machine learning pipeline architectures.

A typical machine learning pipeline
A typical machine learning pipeline with various stages highlighted | Source: Author

Common types of machine learning pipelines

In line with the stages of the ML workflow (data, model, and production), an ML pipeline comprises three different pipelines that solve different workflow stages. They include:

  • 1 Data (or input) pipeline.
  • 2 Model (or training) pipeline.
  • 3 Serving (or production) pipeline.

In large organizations, two or more teams would likely handle each pipeline due to its functionality and scale. The pipelines are interoperable to build a working system:

Data (input) pipeline (data acquisition and feature management steps)

This pipeline transports raw data from one location to another. It covers the entire data movement process, from where the data is collected, for example, through data streams or batch processing, to downstream applications like data lakes or machine learning models. 

Model training pipeline

This pipeline trains one or more models on the training data with preset hyperparameters. It evaluates them, fine-tunes them, and packages the optimal model before sending it downstream to applications like the model registry or serving pipeline.

Serving pipeline

This pipeline deploys the model as a prediction (or scoring) service in production and uses another service to enable performance monitoring.

This article classifies the different pipelines as ā€œmachine learning pipelinesā€ because they enable ML applications based on their function in the workflow. Moreover, they are interoperable to enable production applications, especially during maintenance (retraining and continuous testing).

You may also like

How to Build ML Model Training Pipeline

Elements of a machine learning pipeline

Some pipelines will provide high-level abstractions for these components through three elements:

  1. Transformer: an algorithm able to transform one dataset into another. 
  2. Estimator: an algorithm trained on a dataset to produce a transformer. 
  3. Evaluator: to examine the accuracy of the trained model.

Components of the machine learning pipeline

A pipeline component is one step in the machine learning workflow that performs a specific task by taking input, processing it, and producing an output. The components comprise implementations of the manual workflow process you engage in for automatable steps, including:

  • Data ingestion (extraction and versioning).
  • Data validation (writing tests to check for data quality).
  • Data preprocessing.
  • Model training and tuning, given a select number of algorithms to explore and a range of hyperparameters to use during experimentation.
  • Model performance analysis and evaluation.
  • Model packaging and registration.
  • Model deployment.
  • Model scoring.
  • Model performance monitoring.

With most tools, the pipeline components will contain executable code that can be containerized (to eliminate dependency issues). Each step can be managed with an orchestration tool such as Kubeflow Pipelines, Metaflow, or ZenML.

Letā€™s briefly go over each of the components below.

Data ingestion, extraction, and versioning

This component ingests data from a data source (external to the machine learning pipeline) as input. It then transforms the dataset into a format (i.e., CSV, Parquet, etc.) that will be used in the next steps of the pipeline. At this step, the raw and versioned data are also transformed to make it easier to trace their lineage.

Data validation

This step collects the transformed data as input and, through a series of tests and validators, ensures that it meets the criteria for the next component. It checks the data for quality issues and detects outliers and anomalies. This component also checks for signs of data drift or potential trainingā€“serving skew to send logs to other components or alert the data scientist in charge.

If the validation tests pass, the data is sent to the next component, and if it fails, the error is logged, and the execution stops.

Data preprocessing and feature engineering

The data cleaning, segregation, and feature engineering steps take the validated and transformed data from the previous component as input. The processes involved in this step depend on the problem you are solving and the data. Processes here may include:

  • Feature selection: Select the most appropriate features to be cleaned and engineered.
  • Feature cleaning: Treating missing feature values and removing outliers by capping/flooring them based on code implementation.
  • Feature transformation: Transforming skewed features in the data (if applicable).
  • Feature creation: Creating new features from existing ones or combining different features to create a new one.
  • Data segregation: Splitting data into training, testing, and validation sets.
  • Feature standardization/normalization: Converting the feature values into similar scale and distribution values.
  • Publishing features to a feature store to be used for training and inference by the entire organization.

Again, what goes on in this component is subjective to the data scientistā€™s initial (manual) data preparation process, the problem, and the data used.

Model training and tuning

This component can retrieve prepared features from the feature store or get the prepared dataset (training and validation sets) as input from the previous component. 

This component uses a range of pre-set hyperparameters to train the model (using grid-search CV, Neural Architecture Search, or other techniques). It can also train several models in parallel with different sets of hyperparameter values. The trained model is sent to the next component as an artifact.

Model evaluation

The trained model is the input for this component and is evaluated on the validation set.  You can analyze the results for each model based on metrics such as ROC, AUC, precision, recall, and accuracy. Metrics are usually set based on the problem. Those metrics are then logged for future analysis.

Model analysis and validation

This component:

  • 1 Gauges the modelā€™s ability to generalize to unseen data.
  • 2 Analyzes the modelā€™s interpretability/explainability to help you understand the quality and biases of the model or models you plan to deploy. It examines how well the model performs on data slices and the modelā€™s feature importance. Is it a black-box model, or can the decisions be explained?Ā 

If you train multiple models, the component can also evaluate each model on the test set and provide the option to select an optimal model. 

Here, the component will also return statistics and metadata that help you understand if the model suits the target deployment environment. For example:

  • Is it too large to fit the infrastructure requirements? 
  • How long does it take to return a prediction? 
  • How much resource (CPU usage, memory, e.t.c.) does it consume when it makes a prediction? 

If your pipeline is in deployment, this component can also help you compare the trained modelā€™s metrics to the ones in production and alert you if they are significantly lower.  

Model packaging and registering

This component packages your model for deployment to the staging or production environments. The model artifacts and necessary configuration files are packaged, versioned, and sent to the model registry.

Containers are one helpful technique for packaging models. They encapsulate the deployed model to run anywhere as a separate scoring service. Other deployment options are available, such as rewriting the deployed code in the language for the production environment. It is most common to use containers for machine learning pipelines.

Model deployment

You can deploy the packaged and registered model to a staging environment (as traditional software with DevOps) or the production environment. The staging environment is for integration testing. The staging environment is the first production-like environment where models can be tested with other services in the entire system that enable the applicationā€”for example, deploying a recommendation service and testing it with the backend server that routes the client request to the service.

Some organizations might opt for staging on a container orchestration platform like Kubernetes. It depends on what tool you are using for pipeline orchestration.

Although not recommended, you can also deploy models that have been packaged and registered directly into the production environment.

Model scoring service

The deployed model predicts client requests in real-time (for online systems) or in batches (for offline systems). The predictions are logged to a monitoring service or an online evaluation store to monitor the modelā€™s predictive performance, especially for concept/model drift.

You can adopt deployment strategies such as canary deployment, shadow mode deployment, and A/B testing with the scoring service. For example, you may deploy multiple challenger models with the champion model in production. They will all receive the same prediction requests from clients, but only the champion model will return prediction results. The others will log their predictions with the monitoring service.

Performance monitoring and pipeline feedback loop

The final piece in the pipeline is the monitoring component, which runs checks on the data. It also tracks the collected inference evaluation scores (model metrics or other proxy metrics) to measure the performance of the models in production. 

Some monitoring components also monitor the pipelineā€™s operational efficiency, including:

  • pipeline health, 
  • API calls, 
  • requests timeout, 
  • resource usage, and so on.

For a fully automated machine learning pipeline, continuous integration (CI), continuous delivery (CD), and continuous training (CT) become crucial. Pipelines can be scheduled to carry out CI, CD, or CT. They can also be triggered by:

  • 1 model drift,Ā 
  • 2 data drift,
  • 3 on-demand by the data scientist in charge.

Automating your ML pipeline becomes a crucial productivity decision if you run many models in production.

How to build an end-to-end machine learning pipeline

You build most pipelines in the following sequence:

  • 1 Define the code implementation of the component as modular functions in a script or reuse pre-existing code implementations.
  • 2 Containerize the modular scripts so their implementations are independent and separate.
  • 3 Package the implementations and deploy them on a platform.Ā 

Modular scripts

Defining your components as modular functions that take in inputs and return output is one way to build each component of your ML pipeline. It depends on the language you use to develop your machine learning pipeline. The components are chained with a domain-specific language (DSL) to form the pipeline.

See an example of such a script written in a DSL for defining an ML pipeline in Kubeflow Pipeline below:

import kfp.dsl as dsl

def my_pipeline_step(step_name, param1, param2, ...):
    return dsl.ContainerOp(
        name = step_name,
        image = '<path to my container image>',
        arguments = [
            '--param1', param1,
            '--param2', param2,
            ...
        ],
        file_outputs = {
            'output1' : '/output1.txt',
            'output2' : '/output2.json',
            ...
        }
    )

Packages and containers

You could decide to use a container tool such as Docker or another method to ensure your code can run anywhere.

Orchestration platforms and tools

Pipeline orchestration platforms and tools can help manage your packaged scripts and containers into a DAG or an orchestrated end-to-end workflow that can run the steps in sequence.

Machine Learning pipeline tools

The following are examples of machine learning pipeline orchestration tools and platforms:

  • 1 Metaflow.
  • 2 Kedro pipelines.
  • 3 ZenML.
  • 4 Flyte.
  • 5 Kubeflow pipelines.

Metaflow

Metaflow, originally a Netflix project, is a cloud-native framework that couples all the pieces of the ML stack togetherā€”from orchestration to versioning, modeling, deployment, and other stages. Metaflow allows you to specify a pipeline as a DAG of computations relating to your workflow. Netflix runs hundreds to thousands of machine learning projects on Metaflowā€”thatā€™s how scalable it is.

Metaflow differs from other pipelining frameworks because it can load and store artifacts (such as data and models) as regular Python instance variables. Anyone with a working knowledge of Python can use it without learning other domain-specific languages (DSLs).

Illustration of Metaflow structures
How Metaflow structures different pieces of the ML stack into a flow written in arbitrary Python code. | Source: What is Metaflow | Metaflow Docs.

Learn more about Metaflow in the documentation and get started through the tutorials or resource pages.

Kedro

Kedro is a Python library for building modular data science pipelines. Kedro assists you in creating data science workflows composed of reusable components, each with a “single responsibility,” to speed up data pipelining, improve data science prototyping, and promote pipeline reproducibility.

Kedro nodes, datasets, and pipelines
Kedro nodes (squares), datasets (round-edge rectangles), and pipelines (the interconnection between them) | Source: Kedro Docs Visualise pipelines page

Learn how you can build ML pipelines with Kedro in this article.

ZenML

ZenML is an extensible, open-source MLOps framework for building portable, production-ready MLOps pipelines. It’s built for data scientists and MLOps engineers to collaborate as they develop for production.

Create reproducible ML pipelines with ZenML. | Source: ZenMLā€™s website homepage

Learn more about the core concepts of ZenML in the documentation

Kedro vs. ZenML vs. Metaflow: Which Pipeline Orchestration Tool Should You Choose?

Flyte

Flyte is a platform for orchestrating ML pipelines at scale. You can use Flyte for deployment, maintenance, lifecycle management, version control, and training. You can also use it with platforms like Feast, PyTorch, TensorFlow, and whylogs to do tasks for the whole model lifecycle.

Graph with the architecture of the Flyte platform
The architecture of the Flyte platform. | Source: Flyte: MLOps Simplified

This article by Samhita Alla, a software engineer and tech evangelist at Union.ai, provides a simplified walkthrough of the applications of Flyte in MLOps. Check out the documentation to get started.

Kubeflow Pipelines

Kubeflow Pipelines is an orchestration tool for building and deploying portable, scalable, and reproducible end-to-end machine learning workflows directly on Kubernetes clusters. You can define Kubeflow Pipelines with the following steps:

Step 1: Write the code implementation for each component as an executable file/script or reuse pre-built components.

Step 2: Define the pipeline using a domain-specific language (DSL).

Step 3: Build and compile the workflow you have just defined.

Step 4: Step 3 will create a static YAML file that can be triggered to run the pipeline through the intuitive Python SDK for pipelines.

Kubeflow is notably complex, and with slow development iteration cycles, other K8s-based platforms like Flyte are making it easier to build pipelines. But deploying a cloud-managed service like Google Kubernetes Engine (GKE) can be easier.  

Read also

Experiment Tracking in Kubeflow Pipelines

There are others, such as Prefect and Argo, that you can also look at. This article might be useful, as it compares more than 10 orchestration tools: Best Workflow and Pipeline Orchestration Tools.

DEMO: End-to-end ML pipeline example

In this example, you will build an ML pipeline with Kubeflow Pipelines based on the infamous Titanic ML competition on Kaggle. This project uses machine learning to create a model that predicts which passengers survived the Titanic shipwreck. 

The dataset also provides information on the fate of passengers on the Titanic, summarized according to economic status (class), sex, age, and survival.

Pre-requisites

  • In this demo, you will use MiniKF to set up Kubeflow on AWS. Arrikto MiniKF is the fastest and easiest way to get started with Kubeflow. You can also use MiniKF to set up Kubeflow anywhere, including your local computer. You can learn more about how to set up Kubeflow with MiniKF on Google Cloud and your local computer in the documentation
  • If you donā€™t already have an AWS account, create one
  • Using Arrikto MiniKF via AWS Marketplace costs $0.509/hr as of the time of writing this. The demo takes less than an hour to complete, so you shouldnā€™t spend more than $3 following this demo.
  • This demo uses Arrikto MiniKF v20210428.0.1 and this version installs the following:
    • Kubeflow v1.3.
    • Kale v0.7.0. – An orchestration and workflow tool for Kubeflow that enables you to run complete data science workflows starting from a notebook.
    • Kubernetes (using Minikube v1.22.0).

The demo steps also work with the latest Arrikto MiniKF v20221221.0.0 at the time of writing this. You can follow this tutorial in the official documentation to learn how to deploy Kubeflow with MiniKF through the AWS Marketplace.

If you have deployed Kubeflow with MiniKF, letā€™s jump into the Kubeflow dashboard to set up the project:

Kubeflow dashboard

To get started, click on (1) ā€œNotebooksā€ and (2) ā€œ+NEW SEVERā€.

Specify a name for your notebook server:

Specifying a name for your notebook server

Leave others as default (depending on your requirements, of course) and click ā€œADD VOLUMEā€ under the Data Volumes category:

 Adding a new data volume

You will now see a new data volume added with the name you specified for your server and ā€œ-vol-1/ā€ as a suffix:

New data volume

You can now launch the notebook server:

Launching the notebook server

This might take a couple of minutes to set up, depending on the number of resources you specified. When you see the green checkmark, click on ā€œCONNECTā€:

Connecting the notebook server

That should take you to the Jupyterlab launcher, where you can create a new notebook and access the terminal:

Accessing the terminal

When you launch the terminal, enter the following command (remember to enter your data volume name):

$ cd <ENTER YOUR DATA VOLUME NAME HERE>
$ git clone https://github.com/NonMundaneDev/layer-demo-kubeflow.git

(3) Launch the `layer_kubeflow_titanic_demo.ipynb` notebook:

Launching the `layer_kubeflow_titanic_demo.ipynb` notebook

After running the first code cell, restart your kernel so that the changes can take effect in the current kernel:

Kale helps compile the steps in your notebook into a machine learning pipeline that can be run with Kubeflow Pipelines. To turn the notebook into an ML pipeline, 

(1) Click the Kale icon, and then 

(2) Click enable:

 Turning the notebook into an ML pipeline

Kale will automatically detect the steps it should run and the ones it should skip as part of the exploratory process in the notebook. In this notebook, Kale classes all the steps into a component as they all take input and return an output artifact.

(1) You can now edit the description of your pipeline and other details. When you are done,

(2) click on ā€œCOMPILE AND RUNā€:

Editing the description of your pipeline

If all goes well, you should see a visual like the one below. Click on ā€œViewā€ beside ā€œRunning pipelineā€¦ā€ and a new tab will open:

Opening a new table

You should be able to view a pipeline run and see the DAG (Directed Acyclic Graph) of the Kubeflow Pipeline you just executed with Kale through the Pipeline UI:

 View of a pipeline run

Now to see the result your model returned for the serving step, click on the ā€œrandomforestā€ step > go to ā€œVisualizationsā€ and scroll down to ā€œStatic HTMLā€ section and view the prediction result for the last cell:

 Seeing the results the model returned for the serving step

In this case, based on the dummy data passed in the serving step for the notebook, the model predicted that this particular passenger would not survive the shipwreck.

You can also get the URL endpoint serving your model by taking the following steps:

Get the URL endpoint serving

Click ā€œModelsā€ in the sidebar and observe that a model is already being served. Observe the Predictor, Runtime, and Protocol. Click on the model name.

You will see a dashboard to view the details of the model you are serving in production. 

(1) Monitor your model in production with Metrics and logs to debug errors. You can also see the 

(2) ā€œURL externalā€ and 

(3) ā€œURL internalā€, the endpoints where you can access your model from any other service request or client. The ā€œURL externalā€ can be re-routed to your custom URL.

For now, we will access the model via the terminal through the ā€œURL internalā€ endpoint. Copy the endpoint and head back to your Jupyterlab terminal. Save the endpoint in a variable with the following command:

$ export MODEL_DEPLOYMENT_URL=<ENTER YOUR INTERNAL URL ENDPOINT HERE>
$ curl --header "Content-Type: application/json; format=pandas-records"   --request POST   --data  '{"instances": [[3, 0, 4, 1, 2 ,3, 0, 1, 0, 8, 3, 6, 2]]}'  $MODEL_DEPLOYMENT_URL

You should get the same response as the one from the Pipeline notebook:

Built an end-to-end Pipeline with Kubeflow.

Congratulations! You have built an end-to-end Pipeline with Kubeflow.

Challenges associated with ML pipelines

Some challenges you will likely encounter as you work with ML pipelines include the following:

  • 1 Infrastructure and scaling requirements.
  • 2 Complex workflow interdependencies.
  • 3 Scheduling workflows is a dilemma.
  • 4 Pipeline reproducibility.
  • 5 Experiment tracking.

Infrastructure and scaling requirements

The promise of machine learning pipelines materializes well when you have the excellent infrastructure they should run on. Companies such as Uber, Airbnb, etc. host their infrastructure and have the budget to build it in-house. This is unrealistic, mainly for smaller companies and startups that rely on cloud infrastructure to get their products to market. 

Using cloud infrastructure to run data, training, and production pipelines can lead to exponential costs and bills if you donā€™t appropriately monitor them. You may also encounter situations where different workflow components require significantly different infrastructure needs.

Machine learning pipelines allow you to run experiments efficiently and at scale, but this purpose might be defeated if resources and budget limit you.

Complex workflow interdependencies

Implementing pipeline workflows can be complicated due to the complex interdependence of pipeline steps, which can grow and become difficult to manage. 

Scaling complex workflow interdependencies can also be an issue, as some components might require more computational resources than others. For example, model training can use more computing resources than data transformation.

Workflow scheduling dilemma

Scheduling the workflows in a machine learning pipeline and providing resiliency against errors and unplanned situations can be very tricky. When you use a workflow scheduler, it can be difficult to specify all the actions the orchestrator should take when a job fails. 

Pipeline reproducibility

Running tens to hundreds of pipelines at scale, with multiple interconnected stages that may involve various data transformations, algorithmic parameters, and software dependencies, can affect pipeline reproducibility.

Often forgotten, but the infrastructure, code, and configuration that are used to produce the models are not correctly versioned and are in a non-consumable, reproducible state. ā€” Ketan Umare, Co-Founder and CEO at Union.ai, in an AMA session at MLOps.community 2022.

In other cases, you may build your pipelines with specific hardware configurations running on an operating system and varying library dependencies. But when compiling the pipeline to run in a different environment, these environmental differences can impact the reproducibility of machine learning pipelines.

Best practices for building ML pipelines

From sifting through community conversations to talking to engineers from companies like Brainly and Hypefactors to distilling top learnings from Netflix, Lyft, Spotify, and so on, learn some of the best practices for building ML pipelines below.

Track your machine learning pipelines

We automatically attach an experiment tracker to every pipeline we launch without our users noticing. For us, this ensures at least a minimum set of parameters being trackedā€¦ In principle, we see experiment tracking as a tool that should be used with the pipeline. We recommend using a pipeline to track your experimentsā€”thatā€™s how youā€™ll ensure they are reproducible. ā€” Simon Stiebellehner, MLOps Lead Engineer and MLE Craft Lead at TMNL, in ā€œDifferences Between Shipping Classic Software and Operating ML Modelsā€ on MLOps LIVE.

You want to leverage techniques and technologies to make your pipeline reproducible and debuggable. This involves exploring practices, including:

  • Version control ā€“ to manage dependencies, including code, data, configuration, library dependencies, pipeline metadata, and artifacts, allowing for easy tracking and comparing pipeline versions.
  • Implementing system governance. Depending on the steps in your pipeline, you can analyze the metadata of pipeline runs and the lineage of ML artifacts to answer system governance questions. For example, you could use metadata to determine which version of your model was in production at a given time.
  • Using dedicated tools and frameworks that support tracking and management of pipelines, such as neptune.ai or MLflow, can provide comprehensive tracking and monitoring capabilities.

The tracking tools allow you to: 

  • log experiment results,
  • visualize pipeline components,
  • document details of the steps to facilitate collaboration among team members,
  • monitor pipeline performance during execution, making it easier to track the evolution of the pipeline over time,
  • manage the pipeline’s progress.
Dig deeper

Here’s a nice case study on how ReSpo.Vision tracks their pipelines with neptune.ai

ReSpo.Vision uses ML in sports data analysis to extract 3D data from single-view camera sports broadcast videos. They run a lot of kedro pipelines in the process.

Wojtek Rosiński, CTO at ReSpo.Vision says: ā€œWhen we use Neptune with kedro, we can easily track the progress of pipelines being run on many machines, because often we run many pipelines concurrently, so comfortably tracking each of them becomes almost impossible. With Neptune, we can also easily run several pipelines using different parameters and then compare the results via UI.ā€

Below, you can see an example of how it looks like in the Neptune’s UI.

Neptune natively integrates with tools like Kedro and ZenML. But even without an out-of-the-box integration, you can use it with any other pipeline tool you have in place.

For more:

Compose your pipeline components into smaller functions

Use pipelining tools and the SDK to build your pipeline with reusable components (defined as small functions). See an example that follows the ZenML pipeline workflow:

import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.svm import SVC

from zenml import step


@step
def svc_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train a sklearn SVC classifier."""
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    return model

This way, you can implement your workflow by building custom or reusing preā€‘built components. This can make building new pipelines easier and quicker, debugging existing ones, and integrating them with other organizational tech services.

Do not load things at the module level; this is often a bad thing. You don’t want the module load to take forever and fail. ā€” Ketan Umare, Co-Founder and CEO at Union.ai, in an AMA session at MLOps.community 2022.

Below is another example of a step defined as a function with the Prefect orchestration tool:

@task
def split_data(data: pd.DataFrame):
    # Split the dataset randomly into 70% for training and 30% for testing.
    X = data.drop("rented_bikes", axis=1)
    y = data.rented_bikes
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, train_size=0.7, test_size=0.3, random_state=42
    )
    return X_train, X_test, y_train, y_test

@task
def train_model(X_train: pd.DataFrame, y_train: pd.DataFrame):
    # create model instance: GBRT (Gradient Boosted Regression Tree)
    model = GradientBoostingRegressor()
    # Model Training
    model.fit(X_train, y_train)
    return model

Write pipeline tests

Another best practice is to ensure you build a test suite that covers each aspect of your pipeline, from the functions that make up the components to the entire pipeline run. If possible (and depending on the use case), be willing to automate these tests.

To guarantee that models continue to work as expected during continuous changes to the underlying training or serving container images, we have a unique family of tests applicable to LyftLearn Serving called model self-tests. ā€” Mihir Mathur, Product Manager at Lyft, in ā€œPowering Millions of Real-Time Decisions with LyftLearn Servingā€ blog 2023.

Composing your pipeline components into smaller functions can make it easier to test. See an example from Lyftā€™s model self-tests where they specified a small number of samples for the model inputs and expected outputs in a function called `test_data`:

class SampleNeuralNetworkModel(TrainableModel):

    @property

    def test_data(self) -> pd.DataFrame:

        return pd.DataFrame(

            [

                # input `[1, 0, 0]` should generate output close to `[1]`

                [[1, 0, 0], 1],

                [[1, 1, 0], 1],

            ],

            columns=["input", "score"],

        )

Write your tests locally because, in most cases where your stack and setup make local testing impossible, your users will likely end up testing in production. Containerizing your steps can make testing your pipelines locally or in another environment easier before deploying them to production.

What are the pipeline tests you should write? Eugene Yan, in his article, listed a scope map for what effective pipeline tests should look like, including unit tests, integration tests, function tests, end-to-end tests, and so on. Check out the extensive article.

Conclusion

Building end-to-end machine learning pipelines is a critical skill for modern machine learning engineers. By following best practices such as thorough testing and validation, monitoring and tracking, automation, and scheduling, you can ensure the reliability and efficiency of pipelines. 

With a solid understanding of each pipeline stage’s components, structure, and challenges, you can build robust and scalable pipelines that streamline your ML workflow. 

Happy pipelining!


References

Was the article useful?

Thank you for your feedback!
Thanks for your vote! It's been noted. | What topics you would like to see for your next read?
Thanks for your vote! It's been noted. | Let us know what should be improved.

    Thanks! Your suggestions have been forwarded to our editors