MLOps Blog

How to Build ETL Data Pipeline in ML

Natasha Sharma

7 min

19th May, 2023

General

From data processing to quick insights, robust pipelines are a must for any ML system. Often the Data Team, comprising Data and ML Engineers, needs to build this infrastructure, and this experience can be painful. However, efficient use of ETL pipelines in ML can help make their life much easier.

This article explores the importance of ETL pipelines in machine learning, a hands-on example of building ETL pipelines with a popular tool, and suggests the best ways for data engineers to enhance and sustain their pipelines. We also discuss different types of ETL pipelines for ML use cases and provide real-world examples of their use to help data engineers choose the right one.

Before delving into the technical details, let’s review some fundamental concepts.

What is an ETL data pipeline in ML?

An ETL data pipeline is a collection of tools and activities to perform Extract(E), Transform(T), and Load(L) for the required data.

These activities involve extracting data from one system, transforming it, and then processing it into another target system where it can be stored and managed.

ML heavily relies on ETL pipelines as the accuracy and effectiveness of a model are directly impacted by the quality of the training data. These pipelines assist data scientists in saving time and effort by ensuring that the data is clean, properly formatted, and ready for use in machine learning tasks.

Moreover, ETL pipelines play a crucial role in breaking down data silos and establishing a single source of truth. Let’s look at the importance of ETL pipelines in detail.

Why do we need an ETL pipeline in machine learning?

The significance of ETL pipelines lies in the fact that they enable organizations to derive valuable insights from large and complex data sets. Here are some specific reasons why they are important:

Data Integration: Organizations can integrate data from various sources using ETL pipelines. This provides data scientists with a unified view of the data and helps them decide how the model should be trained, values for hyperparameters, etc.
Data Quality Check: As the data flows through the integration step, ETL pipelines can then help improve the quality of data by standardizing, cleaning, and validating it. This ensures that the data which will be used for ML is accurate, reliable, and consistent.
Save Time: As ETL pipelines automate the process of 3 major steps – Extract, Transform, and Load, this helps in saving a lot of time and also reduces the likelihood of human errors. This allows data scientists to keep their focus on the creation of models or their continuous improvement.

Scalable: Modern ETL pipelines are scalable, i.e., they can be scaled up or down depending on the amount of data it needs to process. Basically, it comes with the flexibility and agility to make any changes based on business needs.

Check also

In-Depth ETL in Machine Learning Tutorial – Case Study With Neptune

What is the difference between ETL and data pipeline?

Data pipeline is an umbrella term for the category of moving data between different systems, and ETL data pipeline is a type of data pipeline. — Xoriant

It is common to use ETL data pipeline and data pipeline interchangeably. Even though both these terms refer to functionalities and processes of passing data from various sources to a single repository, they are not the same. Let’s explore why we should not be using them synonymously.

Comparisons	ETL Pipeline	Data Pipeline
Terminology	As the abbreviation suggests, ETL involves a series of processes, extracting the data, transforming it and at the end loading it to the target source.	A data pipeline also involves moving data from one source to another but doesn’t necessarily have to go through data transformation.
Focus Area	ETL helps to transform the raw data into a structured format that can be easily available for data scientists to create models and interpret for any data-driven decision.	A data pipeline is created with the focus of transferring data from a variety of sources into a data warehouse. Further processes or workflows can then easily utilize this data to create business intelligence and analytics solutions.
Operation	ETL pipeline runs on schedule e.g. daily, weekly or monthly. Basic ETL pipelines are batch-oriented, where data is moved in chunks on a specified schedule.	Data pipelines often run real-time processing. Data gets updated continuously and supports real-time reporting and analysis.

In summary, ETL pipelines are a type of data pipeline that is specifically designed for extracting data from multiple sources, transforming it into a common format, and loading it into a data warehouse or other storage system. While a data pipeline can include various types of pipelines, ETL pipeline is one specific subset of a data pipeline.

We went through the basic architecture of an ETL pipeline and saw how each step can be performed for different purposes, and we can choose from various tools to complete each step. The ELT architecture and its type differ from organization to organization as they have different sets of tech stack, data sources, and business requirements.

What are the different types of ETL pipelines in ML?

ETL pipelines can be categorized based on the type of data being processed and how it is being processed. Here are some of the types:

Batch ETL Pipeline: This is a traditional ETL approach that involves the processing of large amounts of data at once in batches. The data is extracted from one or more sources, transformed into the desired format, and loaded into a target system, such as a data warehouse. Batch ETL is particularly useful for training models on historical data or running periodic batch processing jobs.

Real-time ETL Pipeline: This processes data as it arrives in near-real-time or real-time; processing data continuously means a smaller amount of processing capacity is required at any one time, and spikes in usage can be avoided. Stream/ Real-time ETL is particularly useful for applications such as fraud detection, where real-time processing is critical. The real-time ETL pipelines require tools and technologies like stream processing engines and messaging systems.

Incremental ETL Pipeline: These pipelines only extract and process data that has changed since the last run instead of processing the entire dataset. They are useful for situations where the source data changes frequently, but the target system only needs the latest data e.g. applications such as recommendation systems, where the data changes frequently but not in real-time.

Cloud ETL Pipeline: Cloud ETL pipeline for ML involves using cloud-based services to extract, transform, and load data into an ML system for training and deployment. Cloud providers such as AWS, Microsoft Azure, and GCP offer a range of tools and services that can be used to build these pipelines. For example, AWS provides services such as AWS Glue for ETL, Amazon S3 for data storage, and Amazon SageMaker for ML training and deployment.

Hybrid ETL Pipeline: These pipelines combine batch and real-time processing, leveraging the strengths of both approaches. Hybrid ETL pipelines can process large batches of data at predetermined intervals and also capture real-time updates to the data as they arrive. Hybrid ETL is particularly useful for applications such as predictive maintenance, where a combination of real-time and historical data is needed to train models.

ETL pipeline tools

To create an ETL pipeline, as discussed in the last section, we require tools, tools that can provide us the functionality of following basic ETL architecture steps. There are several tools available in the market, here are some of the popular ones, along with the features they provide.

Tool	Cloudbased	Pre-Built Connectors	Serverless	Pre-Built Transformation Options	API Support	Fully Managed
Hevo Data
AWS Glue
GCP Cloud Data Fusion
Apache Spark
Talend
Apache Airflow

How to build an ML ETL pipeline?

In the previous section, we briefly explored some basic ETL concepts and tools, in this section, we will be discussing how we can leverage them to build an ETL pipeline. First, let’s talk about its architecture.

ETL architecture

The distinctive feature of the ETL architecture is that data goes through all required preparation procedures before it reaches the warehouse. As a result, the final repository contains clean, complete, and trustworthy data to be used further without amendments. — Coupler

ETL architecture often includes a diagram like the one above that outlines the flow of information in the ETL pipeline from data sources to the final destination. It comprises three main areas: Landing area, Staging area, and Data Warehouse area.

The Landing Area is the first destination for data after being extracted from the source location. It can store multiple batches of data before moving it through the ETL pipeline.
The Staging Area is an intermediate location for performing ETL transformations.
The Data Warehouse Area is the final destination for data in an ETL pipeline. It is used for analyzing data to obtain valuable insights and make better business decisions.

ETL data pipeline architecture is layered. Each subsystem is essential, and sequentially, each sub-system feeds into the next until data reaches its destination.

Data Discovery: Data can be sourced from various types of systems, such as databases, file systems, APIs, or streaming sources. We also need data profiling i.e. data discovery, to understand if the data is appropriate for ETL. This involves looking at the data structure, relationships, and content.

Ingestion: You can pull the data from the various data sources into a staging area or data lake. Extraction can be done using various techniques such as APIs, direct database connections, or file transfers. The data can be extracted all at once(extracting from a DB) or incrementally(extracting using APIs), or when there is a change(extracting data from cloud storage like S3 on a trigger).

Transformations: This stage involves cleaning, enriching, and shaping the data to fit the target system requirements. Data can be manipulated using various techniques such as filtering, aggregating, joining, or applying complex business rules. Before manipulating the data, we also need to clean the data, which requires eliminating any duplicate entries, dropping irrelevant data, and identifying erroneous data. This helps to improve data accuracy and reliability for ML algorithms.

Data Storage: Stores the transformed data in a suitable format that can be used by the ML models. The storage system could be a database, a data warehouse, or a cloud-based object store. The data can be stored in a structured or unstructured format, depending on the system’s requirements.
Feature Engineering: Feature engineering involves selecting, transforming, and combining raw data to create meaningful features that can be used for ML models. It directly impacts the accuracy and interpretability of the model. Effective feature engineering requires domain knowledge, creativity, and iterative experimentation to determine the optimal set of features for a particular problem.

Let’s build our own ETL pipeline now using one of the discussed tools!

Building ETL pipeline using AirFlow

Imagine we want to create a machine learning classification model that is able to classify flowers into 3 different categories – Setosa, Versicolour, Virginica. We are going to use a dataset that gets updated, say, every week. This sounds like a job for Batch ETL data pipeline.

To set up a batch ETL data pipeline, we are going to use Apache Airflow, which is an open source workflow management system and offers an easy way to write, schedule and monitor ETL workflows. Follow the steps mentioned below to set up your own Batch ETL pipeline.

Here are the generic steps which we can follow to create ETL workflow in AirFlow:

Set up an Airflow environment: Install and configure Airflow on your system. You can refer to the installation steps here.
Define the DAG & configure the workflow: Define a Directed Acyclic Graph (DAG) in Airflow to orchestrate the ETL pipeline for our ML classifier. DAG will have a collection of tasks with dependencies between them. For this exercise, we are using a python operator to define the tasks, and we are going to keep DAG’s schedule as ‘None’ as we will be running the pipeline manually.

Create a DAG file – airflow_classification_ml_pipeline.py with the below code:

from datetime import timedelta
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG

from airflow.operators.python import PythonOperator
from airflow.operators.empty import EmptyOperator
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago


from python_functions import download_dataset 
from python_functions import data_preprocessing
from python_functions import ml_training_classification

with DAG(
    dag_id='airflow_classification_ml_pipeline', ## Name of DAG run
    default_args=args,
    description='Classification ML pipeline',
    schedule = None,  
 ) as dag:

# Task 1 - Download the dataset
 task_download_dataset = PythonOperator(
 task_id='download_dataset',
 python_callable=download_dataset
 )

 # Task 2 - Transform the data
 task_data_preprocessing = PythonOperator(
 task_id='data_preprocessing',
 python_callable=data_preprocessing
 )

 # Task 3 - Train a ML Model 
 task_ml_training_classification = PythonOperator(
 task_id='ml_training_classification',
 python_callable=ml_training_classification
 )

# Define the workflow process
task_download_dataset >> task_data_preprocessing >> task_ml_training_classification

Implement the ETL tasks: Implement each task defined in the DAG. These tasks will include loading iris dataset from scikit-learn dataset package, transforming the data, and using the refined dataframe to create a machine learning model.

Create a python function file that consists of all the ETL tasks – etl_functions.py.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

import pandas as pd
import numpy as np 

def download_dataset():

    iris = load_iris()
    iris_df = pd.DataFrame(
              data = np.c_[iris['data'], iris['target']],
              columns = iris['feature_names'] + ['target'])

    pd.DataFrame(iris_df).to_csv("iris_dataset.csv")


def data_preprocessing():

    iris_transform_df = pd.read_csv("iris_dataset.csv",index_col=0)
    cols = ["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]
    iris_transform_df[cols] = iris_transform_df[cols].fillna(
                              iris_transform_df[cols].mean())
    iris_transform_df.to_csv("clean_iris_dataset.csv")

Monitor and manage the pipeline: Now that the DAG and workflow code are ready, we can now monitor our entire ETL for ML on the Airflow server.

1. Get the DAG listed in the Airflow server.

2. Check the workflow graph and Run the pipeline (Trigger the DAG):

The workflow graph in Airflow Server — The workflow graph | Source: Author

3. Monitor & check logs: After you trigger the DAG, you can monitor the progress of DAG in the UI. Below images show that all 3 steps were successful.

Monitoring the progress of DAG in the UI | Source: Author

There is a way to check how much time each task has taken using Gantt chart in the UI:

Checking how much time each task has taken using Gantt chart — Checking how much time each task has taken using a Gantt chart | Source: Author

In this exercise, we created an ETL workflow using DAG and didn’t set any schedule, but you can try setting the schedule to whatever you like and monitor the pipeline. You can also try using a dataset which gets updated frequently and based on that decide to set the schedule.

You can also scale airflow orchestration by trying different operators and executors. If you are interested in exploring the Real-time ETL data pipeline, please follow this tutorial.

Best practices around building ETL pipelines in ML

For data-driven organizations, a robust ETL pipeline is essential. This involves:

1 Managing data sources effectively
2 Ensuring data quality and accuracy
3 Optimizing data flow for efficient processing

Integrating machine learning models with data analytics empowers organizations with advanced capabilities to predict demand with enhanced accuracy.

There are several best practices for building an ETL (Extract, Transform, Load) pipeline for Machine Learning (ML) applications. Here are some of the most important ones –

Start with a clear understanding of the requirements. Identify the data sources you will need to support a machine learning model. Ensure that you are using appropriate data types. This helps to confirm data is correctly formatted, which is important for ML algorithms to process the data efficiently. Start with a subset of data and gradually scale up, this helps to keep a check on further tasks/processes.
Correcting or removing inaccuracies and inconsistencies from the data. This is important because ML algorithms can be sensitive to inconsistency and outliers in the data.

Secure your data, implement access control to ensure role-based access to the data.
Make use of distributed file systems, parallelism, staging tables or caching techniques, where possible. This can speed up the processing of data and can help optimise your pipeline. This ultimately helps to improve the performance of the ML model.
Schedule or automate the data-driven workflows to move and transform the data across various sources.
Monitoring and logging your ETL data which will be used by your machine learning models. E.g. you want to keep track of any data drifts which might affect your ML model performance.

Maintain version control of your ETL code base. This helps to track any changes, collaborate with other developers and ensure that the pipeline is running in the same way as expected and won’t impact your model’s performance.
If you are using any cloud-based services, use their ETL templates to save time creating everything from scratch.

Conclusion

Throughout this article, we walked through different aspects of ETL data pipeline in ML.

1 ETL pipeline is important for creating a good machine learning model.
2 Depending on data and the requirement how we can setup ETL architecture and use different types of ETL data pipelines.
3 Building Batch ETL pipeline using Airflow where we can automate the ETL processes. We can also log and monitor the workflows to keep a watch on everything that goes around.
4 How to create scalable and efficient ETL data pipelines .

I hope this article was useful for you. By referring to this article and the hands-on exercise of creating the batch pipeline, you should be able to create one on your own. You can choose any tool mentioned in the article and start your journey.

Happy Learning!