Data Analysis at Warp Speed: Explore the World of Polars

Empowering Data Scientists and Engineers with Lightning-Fast Data Analysis and Transformation Capabilities

Vinayak Shanawad
7 min readJul 1, 2023
Photo by Hans-Jurgen Mager on Unsplash

🎯Goal

The objective of this post is to demonstrate how Polars performance is much better than other open-source libraries in a variety of data analysis tasks, such as data cleaning, data wrangling, and data visualization.

📑Abstract

  • Polars is a fast-growing open-source data frame library that is rapidly becoming the preferred choice for data scientists and data engineers in Python.
  • It is available in multiple languages: Python, Rust, and NodeJS.
  • While Pandas is widely adopted, it has limitations in terms of performance and syntax.
  • It has a syntax that is similar to Pandas, but it is more concise and expressive.
  • It is a powerful in-memory library built with Rust and leveraging Arrow.
  • Polars is built on top of Rust, which is a high-performance programming language, and leverages Arrow, which is a high-performance columnar data format.

If you are looking for a fast and intuitive data frame library for Python, then Polars is a great option.

Open Source Data frame libraries

Data frame libraries (Image by Author)

Let’s look at a detailed comparison of the five libraries.

Comparison of data frame libraries (Image by Author)

🤷‍♂️Why use Polars?

  • It makes use of all available cores in our computer or server unlike Pandas is mostly skewed towards one of the cores.
  • Fast flexible analysis with the Expression API and easy parallel computations.
  • Automatic query optimization in lazy mode.
  • Streaming to work with larger-than-memory datasets.

⏱️Performance benchmarking

Let’s try it on Kaggle competition dataset based on the 2016 NYC Yellow Cab trip record data and see the numbers using different libraries.

Install the libraries

Let’s install polars, modin and pyarrow libraries using pip command.

%pip isntall pandas # pandas==2.0.3
%pip install polars # polars==0.18.4
%pip install modin # modin==0.22.2
%pip install pyarrow # pyarrow==12.0.1

Import the libraries

We are going to import polars, modin and pyarrow libraries, start analyzing the NYC cab dataset, and set upModin.

We need to create a client because that is needed for Modin to work.

import pandas as pd
import polars as pl
import modin.pandas as mpd
from pyarrow import csv

# Setting up Modin
from distributed import Client
client = Client()

Dataset

Let’s use the NYC Yellow Cab trip training dataset for data analysis.

train_set = "train.csv"

Reading with Polars

Let’s read the training dataset using Polars, Wow it’s taking 266ms to read the dataset and it stores all records data in system memory similar to Pandas.

We are using the read_csv function which is an eager mode of execution.

%%timeit -n1 -r1
pl.read_csv(train_set)
Polars reading time (Image by Author)

Reading with Pandas

Let’s read the training dataset using Pandas and it’s taking 3.94 s

%%timeit -n1 -r1
pd.read_csv(train_set)
Pandas reading time (Image by Author)

Reading with pyArrow

PyArrow is a Python library for working with Apache Arrow. Apache Arrow is a high-performance columnar data format that can be used to store and process data. It also provides a variety of functions to work with Apache Arrow data.

Let’s read the training dataset using pyArrow and it’s taking 1.29 s

%%timeit -n1 -r1
csv.read_csv(train_set)
PyArrow reading time (Image by Author)

Reading with Modin that runs Pandas in parallel

Let’s read the training dataset using Modin and it’s taking 3.62 s

%%timeit -n1 -r1
mpd.read_csv(train_set)
Modin reading time (Image by Author)

If we compare these four data frame libraries, Polars is way faster than other libraries with its default configuration.

Query Optimization

Let’s do the query optimization with these libraries and see which one is much faster.

With Polars

We are having 1.4 million records in the NYC Cab training dataset, if we want to read only the top 5 rows, we load the entire 1.4 M records into memory and print the top 5 rows in an eager mode in both Pandas and Polars libraries.

But with the help of Polars lazy mode, we can only load the top 5 rows into memory and print it which is much faster than the eager mode.

We are reading the dataset using Polars lazy mode which is using scan_csv function and it uses Polars optimized plan to load the dataset onto system memory then group it by ‘passenger_count’ and aggregate ‘max’, ‘min’ and ‘mean’ results for ‘trip_duration’ column.

When we have multiple expressions like the below Polars run them in parallel.

Awesome, it’s taking 270 ms

%%timeit -n1 -r1

(
pl.scan_csv(train_set)
.groupby('passenger_count')
.agg(
[
pl.col('trip_duration').mean().suffix('_mean'),
pl.col('trip_duration').max().suffix('_max'),
pl.col('trip_duration').min().suffix('_min'),
]
)
.collect()
)
Polars query time (Image by Author)

With Pandas

We see that Pandas taking 4.04 s for the same query.

%%timeit -n1 -r1

(
pd.read_csv(train_set)
.groupby('passenger_count')
.agg(
{
'trip_duration': ['mean', 'max', 'min']
}
)
)
Pandas query time (Image by Author)

With Pandas with little more optimization

There are 11 columns in the NYC cab dataset, we can optimize the above query further by selecting the required columns and we see that Pandas taking 1.21 s for the same query.

%%timeit -n1 -r1

(
pd.read_csv(train_set, usecols=['passenger_count', 'trip_duration'])
.groupby('passenger_count')
.agg(
{
'trip_duration': ['mean', 'max', 'min']
}
)
)
Pandas query time with little optimization (Image by Author)

With Modin

We see that Modin taking 3.26 s for the same query.

%%timeit -n1 -r1

(
mpd.read_csv(train_set)
.groupby('passenger_count')
.agg(
{
'trip_duration': ['mean', 'max', 'min']
}
)
)
Modin query time (Image by Author)

We noticed that Polars is executing the above query at blazing speed as compared to other libraries.

Polars Lazy mode and query optimization

Polars has a powerful feature called lazy mode. In this mode, Polars looks at a query as a whole to make a query path. Before running the query Polars passes the query graph through its query optimizer to see if there are ways to make the query faster.

When working with a CSV we can switch from eager mode to lazy mode by replacing read_csv with scan_csv

(
pl.scan_csv(train_set)
.groupby('passenger_count')
.agg(
[
pl.col('trip_duration').mean().suffix('_mean')
]
)
)

The output of the above lazy query is LazyFrame and we see an unoptimized or Naive query plan when we execute the above code. Because Polars has considered all columns (*/11) in the Naive plan.

Polars Naive Query Plan (Image by Author)

Query optimizer

We can see the optimized query plan that Polars will actually run by adding explain at the end of the query.

print(
pl.scan_csv(train_set)
.groupby('passenger_count')
.agg(
[
pl.col('trip_duration').mean().suffix('_mean')
]
).explain()
)
Polars Optimized Query Plan (Image by Author)

In this example, Polars has identified an optimization:

PROJECT 2/11 COLUMNS

There are 11 columns in the CSV file, but the query optimizer sees that only two of these columns are required for the query. When the query is evaluated Polars will PROJECT 2 out of 11 columns. Polars will read the 2 required columns from the CSV. This projection saves memory and computation time.

If we want to execute the optimized query plan then use the collect() function to get the optimized query results.

📝Summary

In this post, I have shown you how Polars outperforms the other open-source counterparts. However, the ideal choice of a data frame library ultimately hinges on your specific requirements. If you’re working with vast datasets that demand scalability across multiple machines, then Dask, Rapids, and Spark emerge as formidable contenders. On the other hand, if you need a library that has a similar syntax to Pandas, an extensive feature set, and lightning-fast query execution, Polars is a good option. Dask helps scale Pandas but it might help scale Polars too one day.

📚References

I recommend going through the following references to start learning about Polars library and use it in your projects.

🙏Acknowledgments

A big shout-out and heartfelt thanks to Suman Debnath, Principal Developer Advocate at AWS, for introducing me to the incredible Polars library and inspired to write this post.

BECOME a WRITER at MLearning.ai // invisible ML // 800+ AI tools

--

--

Vinayak Shanawad

Machine Learning Engineer | 3x Kaggle Expert | MLOps | LLMOps | Learning, improving and evolving.