Image generated by Stable Diffusion

ETL Pipelines With Python Azure Functions

George Kamtziridis
5 min readJul 8, 2023

--

Managing data workflows isn’t something new. We’ve been doing it for years by utilizing various tools and platforms. There are lots of different approaches to follow depending on the scenario you’re working on. One of them is Azure functions. In this article we’re going to check what is an Azure function and how we can employ it to create a basic extract, transform and load (ETL) pipeline with minimal code.

Extract, transform and Load

Before we begin, let’s shed some light on what an ETL pipeline essentially is. As it has previously been pointed out, ETL stands for extract, transform and load, and refers to the process that extracts data from a source point, transforms the data in some way and loads the output data to a target point. The source and target points can be of any storage service, for instance an Azure Blob Storage container, an AWS S3 bucket or a database system to name a few.

There are mainly 2 different types of ETL: batch and streaming. A batch ETL works under a predefined schedule in which the data are processed at specific points in time. On the other hand, a streaming ETL is executed quite frequently as new data arrives. For example, Internet of Things (IoT) devices broadcast data in a continuous manner, so in order to be able to monitor them we would need a streaming ETL.

Except ETL, there are other similar processes such as EL and ELT. EL stands for extract and load, and its primary goal is to just move the data from one place to another where the destination is usually a Data Warehouse or a Data Lake. ELT stands for extract, load and transform. The most fundamental difference between ELT and ETL is that the former first loads the data into the target storage and, then, processes them. We won’t delve into more details on the different pipelines in this article. However, if you want to learn more let me know in the comments.

Azure Functions

As Microsoft states, Azure functions are serverless compute services allowing users to run code without the need of managing your own infrastructure. An Azure function contains code written in a programming language, for instance Python, which is triggered on demand. Azure supports multiple triggers such as HTTP requests and data storage changes. You can think of them as relatively small units of work, where each does something specific. Their cost depends on the resources consumed during execution following the pay-as-you go pricing model.

Task

Our task here is to create an ETL pipeline that will be triggered each time a new JSON file appears. The file will contain user transactions consisting of 3 columns: UserId, Amount and Date. The pipeline must read the data, aggregate the amounts per user and, finally, load the output data to another storage unit. The initial file will be provided by, let’s say, the web development team of our company in an Azure Blob storage container. We, the data engineering team, will have to extract the data, preprocess them and, finally, load the data into some other Blob storage container.

Project Setup

The first step to setting up our project is to have a Microsoft Azure account. If you don’t have one, by creating a new account you’ll get $200 in credits for free, which are more than enough for the purposes of our pipeline. Then, we will need to create a new resource group, where all of our resources will reside. After this, create a new storage account with 2 containers. I’ve named the first container `webdev` and the second one `dataengineering`. In the `webdev` container the web development team will upload the JSON files, while in the `dataengineering` container we will store the preprocessed files.

Now that we have a basic infrastructure, let’s create our Azure function workspace. The easiest way to do this is to use the Azure extension in VSCode. You’ll need to login to your Azure account, hit the `Create function` button and follow the wizard. In this article we are going to use the Python v2 programming model with a blob storage trigger.

The project structure has 3 primary files:

  • function_app.py: Contains the core implementation of the Azure function
  • local.settings.json: Contains environment variables such as the Azure Blob Storage connection string
  • requirements.txt: Contains the dependencies of our application

For our scenario, we need the azure-functions package in order to build our function, the azure-storage-blob to communicate with our blob storage and the pandas package to manipulate our data.

My local.settings.json has a property called AzureWebJobsStorage which is essentially the connection string to my storage account. You should put there your own connection string:

Implementation

Azure functions make it very easy for us to implement relatively complex data orchestrations with just a few lines of code, without the overhead of managing our own infrastructure. And just to prove you how easy and clean it is, take a look at the function_app.py file:

The blob_trigger decorator indicates that our function will be triggered by our webdev container which is located in the storage account that uses the AzureWebJobsStorage connection string. Our function receives the blob file and the very first operation we have to do is to convert it to a pandas dataframe in line 15. Then we preprocess our data in line 19 and create a unique file for our new dataset in lines 22–23. Eventually, we save our dataframe to the dataengineering container of our storage service. To keep our function clean and tidy, we’ve moved the preprocess flow and the instructions that save our dataframe, into a separate file called utilities.py.

The preprocess function is self-explanatory. It utilizes the UserId column to group the rows and then calculates the total amount of each user’s transactions. It is important to reset the index in order to include the UserId as a column in the output dataframe. The save_dataframe_to_blob converts the dataframe to a CSV file (line 21), creates a connection to our dataengineering container (lines 24–30) and, lastly, uploads the file (line 31).

We can run the entire flow in debug mode in VSCode just like we would do in any environment without needing to register an actual Azure function in our Azure account. After you hit the Run button, try to manually upload the following file into the webdev container via the Azure portal:

If everything runs smoothly, you’ll see the logging of the initial dataframe size in the VSCode terminal and, of course, a new file in the dataengineering container with a name similar to the one I got: transactions_1688807481_preprocessed.csv.

Conclusion

And that’s basically it! These minimal functions will not only get our job done, but additionally they can support more sophisticated data operations. Furthermore, Azure functions can be used in an Azure Data Factory pipeline which extends by much what can be accomplished in terms of data workflows.

Feel free to play around with the current implementation and let me know in the comments if you’d like to see a more advanced pipeline built in an Azure Data Factory.

BECOME a WRITER at MLearning.ai // invisible ML // 800+ AI tools

--

--

George Kamtziridis

Full Stack Software Engineer and Data Scientist at fromScratch Studio. BEng, MEng (Electrical Engineering/Computer Engineering) MSc (Artificial Intelligence)