Data Analysis at Warp Speed: Explore the World of Polars
Empowering Data Scientists and Engineers with Lightning-Fast Data Analysis and Transformation Capabilities
🎯Goal
The objective of this post is to demonstrate how Polars performance is much better than other open-source libraries in a variety of data analysis tasks, such as data cleaning, data wrangling, and data visualization.
📑Abstract
- Polars is a fast-growing open-source data frame library that is rapidly becoming the preferred choice for data scientists and data engineers in Python.
- It is available in multiple languages: Python, Rust, and NodeJS.
- While Pandas is widely adopted, it has limitations in terms of performance and syntax.
- It has a syntax that is similar to Pandas, but it is more concise and expressive.
- It is a powerful in-memory library built with Rust and leveraging Arrow.
- Polars is built on top of Rust, which is a high-performance programming language, and leverages Arrow, which is a high-performance columnar data format.
If you are looking for a fast and intuitive data frame library for Python, then Polars is a great option.
Open Source Data frame libraries
Let’s look at a detailed comparison of the five libraries.
🤷♂️Why use Polars?
- It makes use of all available cores in our computer or server unlike Pandas is mostly skewed towards one of the cores.
- Fast flexible analysis with the Expression API and easy parallel computations.
- Automatic query optimization in lazy mode.
- Streaming to work with larger-than-memory datasets.
⏱️Performance benchmarking
Let’s try it on Kaggle competition dataset based on the 2016 NYC Yellow Cab trip record data and see the numbers using different libraries.
Install the libraries
Let’s install polars
, modin
and pyarrow
libraries using pip command.
%pip isntall pandas # pandas==2.0.3
%pip install polars # polars==0.18.4
%pip install modin # modin==0.22.2
%pip install pyarrow # pyarrow==12.0.1
Import the libraries
We are going to import polars
, modin
and pyarrow
libraries, start analyzing the NYC cab dataset, and set upModin
.
We need to create a client because that is needed for Modin to work.
import pandas as pd
import polars as pl
import modin.pandas as mpd
from pyarrow import csv
# Setting up Modin
from distributed import Client
client = Client()
Dataset
Let’s use the NYC Yellow Cab trip training dataset for data analysis.
train_set = "train.csv"
Reading with Polars
Let’s read the training dataset using Polars, Wow it’s taking 266ms
to read the dataset and it stores all records data in system memory similar to Pandas.
We are using the read_csv
function which is an eager mode of execution.
%%timeit -n1 -r1
pl.read_csv(train_set)
Reading with Pandas
Let’s read the training dataset using Pandas and it’s taking 3.94 s
%%timeit -n1 -r1
pd.read_csv(train_set)
Reading with pyArrow
PyArrow is a Python library for working with Apache Arrow. Apache Arrow is a high-performance columnar data format that can be used to store and process data. It also provides a variety of functions to work with Apache Arrow data.
Let’s read the training dataset using pyArrow and it’s taking 1.29 s
%%timeit -n1 -r1
csv.read_csv(train_set)
Reading with Modin
that runs Pandas
in parallel
Let’s read the training dataset using Modin and it’s taking 3.62 s
%%timeit -n1 -r1
mpd.read_csv(train_set)
If we compare these four data frame libraries, Polars is way faster than other libraries with its default configuration.
Query Optimization
Let’s do the query optimization with these libraries and see which one is much faster.
With Polars
We are having 1.4 million records in the NYC Cab training dataset, if we want to read only the top 5 rows, we load the entire 1.4 M records into memory and print the top 5 rows in an eager mode in both Pandas and Polars libraries.
But with the help of Polars lazy mode, we can only load the top 5 rows into memory and print it which is much faster than the eager mode.
We are reading the dataset using Polars lazy mode which is using scan_csv
function and it uses Polars optimized plan to load the dataset onto system memory then group it by ‘passenger_count’ and aggregate ‘max’, ‘min’ and ‘mean’ results for ‘trip_duration’ column.
When we have multiple expressions like the below Polars run them in parallel.
Awesome, it’s taking 270 ms
%%timeit -n1 -r1
(
pl.scan_csv(train_set)
.groupby('passenger_count')
.agg(
[
pl.col('trip_duration').mean().suffix('_mean'),
pl.col('trip_duration').max().suffix('_max'),
pl.col('trip_duration').min().suffix('_min'),
]
)
.collect()
)
With Pandas
We see that Pandas taking 4.04 s
for the same query.
%%timeit -n1 -r1
(
pd.read_csv(train_set)
.groupby('passenger_count')
.agg(
{
'trip_duration': ['mean', 'max', 'min']
}
)
)
With Pandas
with little more optimization
There are 11 columns in the NYC cab dataset, we can optimize the above query further by selecting the required columns and we see that Pandas taking 1.21 s
for the same query.
%%timeit -n1 -r1
(
pd.read_csv(train_set, usecols=['passenger_count', 'trip_duration'])
.groupby('passenger_count')
.agg(
{
'trip_duration': ['mean', 'max', 'min']
}
)
)
With Modin
We see that Modin taking 3.26 s
for the same query.
%%timeit -n1 -r1
(
mpd.read_csv(train_set)
.groupby('passenger_count')
.agg(
{
'trip_duration': ['mean', 'max', 'min']
}
)
)
We noticed that Polars is executing the above query at blazing speed as compared to other libraries.
Polars Lazy mode and query optimization
Polars has a powerful feature called lazy mode. In this mode, Polars looks at a query as a whole to make a query path. Before running the query Polars passes the query graph through its query optimizer to see if there are ways to make the query faster.
When working with a CSV we can switch from eager mode to lazy mode by replacing read_csv
with scan_csv
(
pl.scan_csv(train_set)
.groupby('passenger_count')
.agg(
[
pl.col('trip_duration').mean().suffix('_mean')
]
)
)
The output of the above lazy query is LazyFrame and we see an unoptimized or Naive query plan when we execute the above code. Because Polars has considered all columns (*/11) in the Naive plan.
Query optimizer
We can see the optimized query plan that Polars will actually run by adding explain
at the end of the query.
print(
pl.scan_csv(train_set)
.groupby('passenger_count')
.agg(
[
pl.col('trip_duration').mean().suffix('_mean')
]
).explain()
)
In this example, Polars has identified an optimization:
PROJECT 2/11 COLUMNS
There are 11 columns in the CSV file, but the query optimizer sees that only two of these columns are required for the query. When the query is evaluated Polars will PROJECT
2 out of 11 columns. Polars will read the 2 required columns from the CSV. This projection saves memory and computation time.
If we want to execute the optimized query plan then use the collect()
function to get the optimized query results.
📝Summary
In this post, I have shown you how Polars outperforms the other open-source counterparts. However, the ideal choice of a data frame library ultimately hinges on your specific requirements. If you’re working with vast datasets that demand scalability across multiple machines, then Dask, Rapids, and Spark emerge as formidable contenders. On the other hand, if you need a library that has a similar syntax to Pandas, an extensive feature set, and lightning-fast query execution, Polars is a good option. Dask helps scale Pandas but it might help scale Polars too one day.
📚References
I recommend going through the following references to start learning about Polars library and use it in your projects.
🙏Acknowledgments
A big shout-out and heartfelt thanks to Suman Debnath, Principal Developer Advocate at AWS, for introducing me to the incredible Polars library and inspired to write this post.