Analyzing Real-Time Data Streams with Window Functions in Apache Spark

Write your first code to learn how window functions work with streaming data

Necati Demir
3 min readJul 20, 2023

In the growing world of Big Data, streaming data processing is a key component in dealing with real-time analytics. There are many frameworks out there, but Apache Spark holds an important place in this world. In this article, we will take a deep dive into a crucial part of PySpark Streaming: window functions.

By the end of this article, you’ll be able to leverage PySpark’s window functions to gain useful insights from your streaming data. We’ll be focusing on structured streaming in this article.

What is Apache Spark and What are The Basics of Streaming?

I have been writing about Apache Spark recently and published articles on this topic. I highly recommend you to go through these articles if you are new to these topics:

Window Functions in Streaming

Window functions in Streaming allow the user to define a “window” or specific time range in the streaming data. By defining a window, users can perform various operations like aggregation, adding or removing data, on a specific time range of data rather than on the entire data stream.

Let’s see how it works with an example. Let’s say you an ecommerce website and you are streaming the data of which user viewed which product. And, you want to calculate the most viewed product every 15 minutes.

Setting Up Streaming and Using Window Functions

Before we dive into the window functions, we need to set up a PySpark structured streaming environment. In this example, we will assume data is coming through a socket connection but in real world examples Kafka would be a better choice.

from pyspark.sql import SparkSession

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, window, current_timestamp

# Set up Spark Session
spark = SparkSession.builder \
.appName("PySpark Structured Streaming with Window Functions") \
.getOrCreate()

# Define a socket (netcat) data source for structured streaming
df = spark.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()

# Split the lines into words, with comma as the separator
df = df.select(split(df.value, ",").alias("data"))

# Transform DataFrame to have "user_id" and "product_id" columns
df = df.select(df.data.getItem(0).alias("user_id"), df.data.getItem(1).alias("product_id"))

# Convert "user_id" and "product_id" to appropriate data types
df = df.withColumn("user_id", df["user_id"].cast("string"))
df = df.withColumn("product_id", df["product_id"].cast("string"))

# Add a timestamp column to the dataframe
df = df.withColumn("timestamp", current_timestamp())

# Apply window function - window duration of 15 minutes
df_with_window = df.groupBy(
window(df.timestamp, "15 minutes"),
df.product_id
).count()

# Write stream out to console
query = df_with_window \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()

query.awaitTermination()

In this setup, we’re reading data from port:9999. After setting up the stream, we split the incoming the string data and apply our schema to the data.

Once we are ready with that, we can apply window functions. And, here we are going to apply a window function that will give us the count of product views for every 15-minute window.

The outputMode(“complete”) means that the entire updated Result Table will be written to the console sink after every trigger. This is suitable for aggregations where the result is expected to be small, such as counting.

Finally, query.awaitTermination() is called to wait for the streaming computation to stop.

Summary & Key Takeaways

In this article, we explored window functions in Apache Spark Streaming for real-time data analytics. Window functions gives us the ability to run operations on defined time ranges while data is coming through. We also provided a hands-on example which demonstrates setting up a structured streaming environment, manipulating data, and applying window functions to track product views every 15 minutes. Here are some key takeaways:

  1. Window functions in streaming allow users to define a “window” or specific time range in the data stream to perform operations like aggregation, adding, or removing data.
  2. In a real-world scenario, data could potentially be streaming from various sources, like Kafka.
  3. After setting up the stream and schema, window functions can be applied. The example demonstrates calculating the most viewed product every 15 minutes.

WRITER at MLearning.ai // Code Interpreter // Animate Midjourney

--

--