The Art of Forecasting in the Retail Industry

Multivariate Time Series Forecasting

Part I : Exploratory Data Analysis & Time Series Analysis

Cem ÖZÇELİK

--

In this article, I will conduct exploratory data analysis and time series analysis using a dataset consisting of product sales in different categories from a store in the US between 2015 and 2018. Following these steps, I will preprocess the data and perform a multivariate forecasting study.

Pixabay, Pexels

Hello, today I am thrilled to finally sit in front of my computer and delve into an exciting data science project that I have been putting off for quite some time. In this article, I will share the first part of my two-phase project. Through exploratory data analysis and time series analysis, I will embark on a captivating journey into the realm of my dataset. Here, I will uncover the secrets behind sales trends in different categories of products. By visualizing the data, we will witness the fluctuations in sales figures and then delve deeply into the intricate dynamics of these patterns over time.

I utilized the Plotly library as a visualization tool to gain insights from my dataset. Plotly proved to be immensely helpful in creating interactive graphs for visualizing the data. I highly recommend using this library for data visualization purposes. Given that my dataset is a time series dataset, this article includes an exploratory data analysis section covering various aspects of time series analysis, such as stationarity, seasonality, and cyclicality.

Now I can start by importing the libraries that I will be using in the study.

import numpy as np 
import pandas as pd
import seaborn as sns
from datetime import datetime as dt


import plotly.express as px
import plotly.graph_objects as go


import itertools

import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.tsa.api as smt

import warnings
warnings.filterwarnings('ignore')
#pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', lambda x: '%.f' % x)
pd.set_option('display.expand_frame_repr', False)

After importing the necessary libraries, I proceed to import the dataset that I will be using for this project. The work presented in this article is based on my notebook published on Kaggle, where I shared my findings and analysis. By leveraging the power of data science techniques, I aim to offer valuable insights and illustrate the step-by-step process involved in this project. The dataset serves as a valuable resource to demonstrate the practical application of various data analysis and forecasting techniques.

dataset = pd.read_csv('/kaggle/input/sales-forecasting/train.csv')
df = dataset.copy()

Well, once I’ve imported the dataset, I begin my work. In the first step, I define a function called ‘check_df.’ This function is highly valuable for comprehending the overall structure of the dataset. It offers access to various information, including the dataset’s dimensions, feature names, data types, and the ability to review the first and last 5 rows of the dataset. Furthermore, I aim to uncover valuable patterns and trends within the dataset by analyzing the statistical properties of numerical variables

def check_df(dataset, head = 5):
print('#'*30 + 'Shape of Dataset' + '#'*30, end = '\n'*2)
print(dataset.shape, end = '\n'*2)
print('#'*30 + 'General informations about to Dataset' + '#'*30, end = '\n'*2)
print(dataset.info(), end = '\n'*2)
print('#'*30 + 'First 5 Lines Of Dataset' + '#'*30, end = '\n'*2)
print(dataset.head(head), end = '\n'*2)
print('#'*30 + 'NaN values of Features' + '#'*30, end = '\n'*2)
print(dataset.isnull().sum(), end = '\n'*2)
print('#'*30 + 'Descriptive Statistics of Numerical Features' + '#'*30, end = '\n'*2)
print(dataset.describe().T, end = '\n'*2)
print('#'*30 + 'Quantiles of Numerical Features' + '#'*30, end ='\n'*2)
print(dataset.describe([0,0.10, 0.25, 0.50,0.75,0.99]).T, end = '\n'*2)

check_df(df)

When I run the function called check_df that I created, the general view of the dataset is as follows:

##############################Shape of Dataset##############################

(9800, 18)

##############################General informations about to Dataset##############################

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9800 non-null int64
1 Order ID 9800 non-null object
2 Order Date 9800 non-null object
3 Ship Date 9800 non-null object
4 Ship Mode 9800 non-null object
5 Customer ID 9800 non-null object
6 Customer Name 9800 non-null object
7 Segment 9800 non-null object
8 Country 9800 non-null object
9 City 9800 non-null object
10 State 9800 non-null object
11 Postal Code 9789 non-null float64
12 Region 9800 non-null object
13 Product ID 9800 non-null object
14 Category 9800 non-null object
15 Sub-Category 9800 non-null object
16 Product Name 9800 non-null object
17 Sales 9800 non-null float64
dtypes: float64(2), int64(1), object(15)
memory usage: 1.3+ MB
None

##############################First 5 Lines Of Dataset##############################

Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City State Postal Code Region Product ID Category Sub-Category Product Name Sales
0 1 CA-2017-152156 08/11/2017 11/11/2017 Second Class CG-12520 Claire Gute Consumer United States Henderson Kentucky 42420 South FUR-BO-10001798 Furniture Bookcases Bush Somerset Collection Bookcase 262
1 2 CA-2017-152156 08/11/2017 11/11/2017 Second Class CG-12520 Claire Gute Consumer United States Henderson Kentucky 42420 South FUR-CH-10000454 Furniture Chairs Hon Deluxe Fabric Upholstered Stacking Chairs,... 732
2 3 CA-2017-138688 12/06/2017 16/06/2017 Second Class DV-13045 Darrin Van Huff Corporate United States Los Angeles California 90036 West OFF-LA-10000240 Office Supplies Labels Self-Adhesive Address Labels for Typewriters b... 15
3 4 US-2016-108966 11/10/2016 18/10/2016 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale Florida 33311 South FUR-TA-10000577 Furniture Tables Bretford CR4500 Series Slim Rectangular Table 958
4 5 US-2016-108966 11/10/2016 18/10/2016 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale Florida 33311 South OFF-ST-10000760 Office Supplies Storage Eldon Fold 'N Roll Cart System 22

##############################NaN values of Features##############################

Row ID 0
Order ID 0
Order Date 0
Ship Date 0
Ship Mode 0
Customer ID 0
Customer Name 0
Segment 0
Country 0
City 0
State 0
Postal Code 11
Region 0
Product ID 0
Category 0
Sub-Category 0
Product Name 0
Sales 0
dtype: int64

##############################Descriptive Statistics of Numerical Features##############################

count mean std min 25% 50% 75% max
Row ID 9800 4900 2829 1 2451 4900 7350 9800
Postal Code 9789 55273 32041 1040 23223 58103 90008 99301
Sales 9800 231 627 0 17 54 211 22638

##############################Quantiles of Numerical Features##############################

count mean std min 0% 10% 25% 50% 75% 99% max
Row ID 9800 4900 2829 1 1 981 2451 4900 7350 9702 9800
Postal Code 9789 55273 32041 1040 1040 10024 23223 58103 90008 98115 99301
Sales 9800 231 627 0 0 8 17 54 211 2480 22638

When we observe the general structure of the dataset, we see that we are working with a dataset containing 9800 observation lines and 18 variables. We also detect missing observations in the variable named ‘Postal Code’. By examining the missing observations, we aim to develop strategies on how to fill these values. In addition, there are a few more variables in our data set that, although they have numerical values, are actually categorical variables according to the information they contain. For example, the postcode and Row ID columns are represented numerically, even though they actually have a categorical attribute. We will move forward in converting these variables to categorical variables.

df.loc[(df["Postal Code"].isnull()), :][["City", "State"]].value_counts()

## Output ##

City State
Burlington Vermont 11
dtype: int64

All 11 rows of missing observations in the dataset are observations from the same State-City group. With a simple Google search, I found the Postal Code for this region and filled in these values for the missing observations. In order to continue our work, we need to change the types of some variables. Columns ‘Ship Date’ and ‘Order Date’ are seen as ‘object’ as data types, but since these two variables actually contain date information, I want to convert them to ‘datetime’ format.

df.drop(columns = ['Row ID'], inplace = True) 

df.loc[(df["Postal Code"].isnull()), "Postal Code"] = 5401

df["Postal Code"] = df["Postal Code"].astype("str")

def convert_datetime(dataframe):
date_cols = [col for col in dataframe.columns if 'Date' in col]

for col in date_cols:
dataframe[col] = pd.to_datetime(dataframe[col], errors = 'coerce')
return dataframe.info()

convert_datetime(df)



## Output ## :

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order ID 9800 non-null object
1 Order Date 9800 non-null datetime64[ns]
2 Ship Date 9800 non-null datetime64[ns]
3 Ship Mode 9800 non-null object
4 Customer ID 9800 non-null object
5 Customer Name 9800 non-null object
6 Segment 9800 non-null object
7 Country 9800 non-null object
8 City 9800 non-null object
9 State 9800 non-null object
10 Postal Code 9800 non-null object
11 Region 9800 non-null object
12 Product ID 9800 non-null object
13 Category 9800 non-null object
14 Sub-Category 9800 non-null object
15 Product Name 9800 non-null object
16 Sales 9800 non-null float64
dtypes: datetime64[ns](2), float64(1), object(14)
memory usage: 1.3+ MB

We fixed the missing observations in the dataset and edited the wrong data type. After converting the Date variables in my dataset from object data type to Datetime data type, my next step involves categorizing the variables in the dataset into different types, including categorical, numerical etc.. Following this categorization, I will employ various visualizations to examine the class frequencies of the categorical variables.

def grab_col_name(dataframe, cat_th = 5, car_th = 49):
cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
num_but_cat = [col for col in dataframe.columns if dataframe[col].dtypes != "O" and dataframe[col].nunique() < cat_th]
cat_but_car = [col for col in dataframe.columns if dataframe[col].dtypes == "O" and dataframe[col].nunique() > car_th]

cat_cols = cat_cols + num_but_cat
cat_cols = [col for col in cat_cols if col not in cat_but_car]

num_cols = [col for col in dataframe.columns if dataframe[col].dtypes in ["int64", "float64"] and "ID" not in col.upper()]
num_cols= [col for col in num_cols if col not in num_but_cat]

date_cols = [col for col in dataframe.columns if 'Date' in col]

for col in date_cols:
dataframe[col] = pd.to_datetime(dataframe[col], errors = 'coerce')

return cat_cols, num_cols, cat_but_car,date_cols


grab_col_name(df)

## Output :

(['Ship Mode', 'Segment', 'State', 'Region', 'Category', 'Sub-Category'],
['Sales'],
['Row ID',
'Order ID',
'Customer Name',
'City',
'Postal Code',
'Product ID',
'Product Name'],
['Order Date', 'Ship Date'])

##

cat_cols, num_cols, cat_but_car, date_cols = grab_col_name(df)

I am currently categorizing the variables in the dataset based on their types. Next, I will define a function to visualize the frequency of category classes in my categorical data.

# Function for Categorical Variables:

def cat_summary(dataframe, plot=False, categorical_columns = cat_cols, threshold=10):
for col_name in categorical_columns:
print("#" * 30 + " Unique Values Of " + col_name + " column " + "#" * 30)
print(col_name + ": " + str(dataframe[col_name].nunique()))

print("#" * 30 + " Frequencies of Class Values " + "#" * 30)
value_counts = dataframe[col_name].value_counts()
if dataframe[col_name].nunique() > threshold:
top_n_values = value_counts.head(threshold)
#other_count = value_counts[threshold:].sum()
#top_n_values['Others'] = other_count
value_counts = top_n_values

print(pd.DataFrame({col_name: value_counts,
"Ratio": value_counts / len(dataframe[col_name])}))

if plot:
rgb_values = sns.color_palette("Set2", 6)
sns.set_theme(style="darkgrid")

if dataframe[col_name].nunique() > threshold:
plt.figure(figsize=(13,6))

ax = sns.barplot(x=value_counts.index, y=value_counts.values, palette=rgb_values)
ax.set_xticklabels(ax.get_xticklabels(), rotation=60)
for p in ax.patches:
ax.annotate(f'\n{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='top', color='white', size=10)
plt.show()

# Function for Numerical Variables :

def num_summary(dataframe, plot = False):
quantiles = [0, 0.25, 0.50, 0.75, 1.0]

for col_name in num_cols:
print("#"*30 + " Distribution of " + col_name+ " column" + "#"*30)
print(dataframe[col_name].describe(quantiles).T)

if plot:
sns.histplot(data = dataframe, x = col_name)
plt.xlabel(col_name)
plt.title("The Distribution of "+ col_name)
plt.grid(True, alpha = 0.5)
plt.show(block = True)

# Function For Percent Format

def autopct_format(values):
def my_format(pct):
total = sum(values)
val = int(round(pct*total/100.0))
return ' ${v:d}'.format(v=val)
return my_format
categorical_columns = ['Ship Mode', 'Segment', 'State','Region', 
'Category', 'Sub-Category', 'Customer Name', 'City']

cat_summary(df,True,categorical_columns = categorical_columns,threshold = 10)
############################## Unique Values Of Ship Mode column ##############################
Ship Mode: 4
############################## Frequencies of Class Values ##############################
Ship Mode Ratio
Standard Class 5859 0.60
Second Class 1902 0.19
First Class 1501 0.15
Same Day 538 0.05

When we examine the Ship Mode variable, we see that the class named Standard Class is predominantly in the data set. Orders are generally sent as standard.

############################## Unique Values Of Segment column ##############################
Segment: 3
############################## Frequencies of Class Values ##############################
Segment Ratio
Consumer 5101 0.52
Corporate 2953 0.30
Home Office 1746 0.18

In the segment variable, we see that the Consumer type customer group is dominant. In the second place is the name Corporate.

############################## Unique Values Of State column ##############################
State: 49
############################## Frequencies of Class Values ##############################
State Ratio
California 1946 0.20
New York 1097 0.11
Texas 973 0.10
Pennsylvania 582 0.06
Washington 504 0.05
Illinois 483 0.05
Ohio 454 0.05
Florida 373 0.04
Michigan 253 0.03
North Carolina 247 0.03

When we look at the frequencies of the states in the data set, we see that California is dominant.

############################## Unique Values Of Region column ##############################
Region: 4
############################## Frequencies of Class Values ##############################
Region Ratio
West 3140 0.32
East 2785 0.28
Central 2277 0.23
South 1598 0.16

When we look at the region variable, the distribution seems somewhat homogeneous and class frequencies seem close to each other. But West is seen more intensely.

############################## Unique Values Of Category column ##############################
Category: 3
############################## Frequencies of Class Values ##############################
Category Ratio
Office Supplies 5909 0.60
Furniture 2078 0.21
Technology 1813 0.18

We see that there are 3 classes in the Category variable, which stores the categories of the products sold. I see that the Office Supplies class is dominant. I see Furniture in second place.

############################## Unique Values Of Sub-Category column ##############################
Sub-Category: 17
############################## Frequencies of Class Values ##############################
Sub-Category Ratio
Binders 1492 0.15
Paper 1338 0.14
Furnishings 931 0.10
Phones 876 0.09
Storage 832 0.08
Art 785 0.08
Accessories 756 0.08
Chairs 607 0.06
Appliances 459 0.05
Labels 357 0.04

When I look at the distribution of class frequencies of the Sub-categories variable, we see that classes such as Binders and Papers are the most sold Sub-Categories.

############################## Unique Values Of City column ##############################
City: 529
############################## Frequencies of Class Values ##############################
City Ratio
New York City 891 0.09
Los Angeles 728 0.07
Philadelphia 532 0.05
San Francisco 500 0.05
Seattle 426 0.04
Houston 374 0.04
Chicago 308 0.03
Columbus 221 0.02
San Diego 170 0.02
Springfield 161 0.02

When we look at the City variable, we see that the class named New York City is dominant.

I utilize both histograms and box plots to visualize the distribution of sales values. When we examine the statistical characteristics of the sales values using the ‘check_df’ function, we find that the series has a mean of 231, a standard deviation of 627, and a median value of 54. Additionally, the maximum value observed is 22,638. Based on this data, it’s evident that our dataset exhibits a skewed distribution. Given that the median of the values in the dataset is smaller than the mean, we expect a right-skewed histogram. In such cases, it’s more appropriate to use the median as a measure of central tendency to represent our dataset.

import plotly.graph_objects as go
import plotly.subplots as sp

fig_histogram = go.Figure()

fig_histogram.add_trace(go.Histogram(x=df['Sales'], name='Sales', histnorm='probability', opacity=0.7))

fig_histogram.update_layout(
title='Sales Distribution (Histogram)',
xaxis_title='Sales',
yaxis_title='Probability',
barmode='overlay',
bargap=0.1,
legend=dict(x=0.7, y=0.95),
autosize=False,
width=500,
height=400
)


fig_boxplot = go.Figure()

fig_boxplot.add_trace(go.Box(y=df['Sales'], name='Sales', boxpoints='all', jitter=0.3, pointpos=-1.8))

fig_boxplot.update_layout(
title='Sales Distribution (Box Plot)',
yaxis_title='Sales',
autosize=False,
width=500,
height=400
)


mean = df['Sales'].mean()
median = df['Sales'].median()
max_value = df['Sales'].max()

fig_histogram.add_shape(
type='line',
x0=mean,
x1=mean,
y0=0,
y1=1,
line=dict(color='red', width=3, dash='dash'),
name='Mean'
)

fig_histogram.add_shape(
type='line',
x0=median,
x1=median,
y0=0,
y1=1,
line=dict(color='green', width=3, dash='dash'),
name='Median'
)

fig_histogram.add_shape(
type='line',
x0=max_value,
x1=max_value,
y0=0,
y1=1,
line=dict(color='blue', width=3, dash='dash'),
name='Max'
)


fig_dashboard = sp.make_subplots(rows=1, cols=2, subplot_titles=("Histogram", "Box Plot"))

fig_dashboard.add_trace(fig_histogram['data'][0], row=1, col=1)
fig_dashboard.add_trace(fig_boxplot['data'][0], row=1, col=2)

fig_dashboard.update_layout(height=600, width=1200, title_text="Sales Distribution Dashboard")


fig_dashboard.show()

As we predicted above, our dataset exhibits a right-skewed distribution. Another remarkable point is that the sales value is at the maximum level. This value can be seen as quite an extreme value compared to the overall distribution of the data set and may even be considered an outlier. However, it would not be correct to say that this value is only contrary to the general data distribution. Let’s consider an example scenario; Let’s have Category and Sub-Category, which are different categorical variables in our dataset. Some of the sales that occur in the categories of these categorical variables may be high-priced products purchased in large quantities. This may result in the observed values in our data set. At this point, it is important to understand the relationship between numerical variables and categorical variables. We will carry out the next steps of our work with this focus. Now I will try to explore the categorical variables that interact with the target variable Sales in the dataset.

from matplotlib import pyplot as plt
from matplotlib import figure
def desc_stats(dataframe):
desc = dataframe.describe(percentiles =[0, 0.25, 0.50, 0.75, 0.85, 0.95, 0.99, 1]).T
desc_df = pd.DataFrame(index= dataframe.columns,
columns= desc.columns,
data= desc)

f,ax = plt.subplots(figsize=(20,
desc_df.shape[0]*0.78))
sns.heatmap(desc_df,
annot=True,
cmap = "Wistia",
fmt= '.2f',
ax=ax,
linecolor='white',
linewidths = 1.3,
cbar = False,
annot_kws={"size": 12})
plt.xticks(size = 18)
plt.yticks(size = 14,
rotation = 0)
plt.title("Descriptive Statistics", size = 14)
plt.show()


desc_stats(df[[col for col in df.columns if df[col].dtype in ['int64', 'float64'] ]])

As we observed, the maximum value of the sales variable in the data set is much larger than the other quantile values. This indicates that sales values are generally low but increase significantly after 85%. For example, values up to 75% show a normal distribution, while a three-fold increase is observed between 85% and 95%. This may suggest that high resale values do occur, although rarely, and that customers with higher incomes usually shop in large volumes. By examining this relationship in more detail, we will try to understand the factors that affect sales value.

Now I will create a barplot to visualize the 10 highest revenue generating customers. This barplot will use the magnitude of sales values to color the bars representing each customer. This will enable us to more clearly highlight customers with higher sales values, allowing us to more easily identify the most important customers in terms of revenue. This visualization will help us identify the business’ main sources of revenue and the most valuable customer segments.

customer_sales = df.groupby('Customer Name').agg({'Sales':"sum"}).reset_index().sort_values(by = "Sales", ascending = False)[:10]


fig = px.bar(customer_sales, x='Customer Name', y='Sales', hover_data=['Customer Name', 'Sales'],
labels={'Customer Name': 'Customer Name', 'Sales': 'Sales'},
color = 'Sales')


fig.update_traces(hovertemplate='Customer Name: %{x}<br>Sales: $%{y}<extra></extra>')

fig.update_layout(title='Top 10 Customer',width=1200, height=600)

fig.update_traces(text=customer_sales['Sales'],
texttemplate='$%{text:.2f}',
textfont=dict(color='gray', size=14, family='Arial'),
textangle=0,textposition='outside')


fig.show()

I create Barplot to examine the Sales variable relative to the State variable. There are 49 different state values in the data set, but I will be showing the 10 states with the most sales.

state_sales = df.groupby('State').agg({'Sales':"sum", "Order ID": lambda orderid: orderid.nunique()}).reset_index().sort_values(by = "Sales", ascending = False)[:10]



fig = px.bar(state_sales, x='State', y='Sales', hover_data=['State', 'Sales'],
labels={'State': 'State', 'Sales': 'Sales'},
color = 'Sales')


fig.update_traces(hovertemplate='State: %{x}<br>Sales: $%{y}<extra></extra>')

fig.update_layout(title='Total Sales by State',width=1200, height=600)

fig.update_traces(text=state_sales['Sales'],
texttemplate='$%{text:.2f}',
textfont=dict(color='gray', size=14, family='Arial'),
textangle=0,textposition='outside')


fig.show()
city_sales = df.groupby('City').agg({'Sales':"sum", "Order ID": lambda orderid: orderid.nunique()}).reset_index().sort_values(by = "Sales", ascending = False)[:10]



fig = px.bar(city_sales, x='City', y='Sales', hover_data=['City', 'Sales'],
labels={'City': 'City', 'Sales': 'Sales'},
color = 'Sales')


fig.update_traces(hovertemplate='City: %{x}<br>Sales: $%{y}<extra></extra>')

fig.update_layout(title='Total Sales by City',width=1200, height=600)

fig.update_traces(text=city_sales['Sales'],
texttemplate='$%{text:.2f}',
textfont=dict(color='gray', size=14, family='Arial'),
textangle=0,textposition='outside')


fig.show()

When I examine the visualization, I see that the highest sales occurred in New York — NYC. However, the California state cities of LA, San Francisco and San Diego also have significant sales. One of the highlights on the chart is that California comes before New York. This shows that while New York has the most sales when compared at the City level, California is at the top of the list when compared at the State level. This shows that California’s overall sales volume is high and its different cities have a significant share of overall sales.

Alongside the sales values, I would like to analyze the number of orders received by the store for products purchased from these cities. I aim to determine whether there is a correlation between the sales value and the number of orders.

We examined sales values according to State and City. Finally, I create a barplot in the State variable to see the number of orders received.

city_sales = df.groupby('City').agg({"Order ID": lambda orderid: orderid.nunique()}).rename(columns = {'Order ID' : 'Order Count'}).reset_index().sort_values(by = "Order Count", ascending = False)[:10]



fig = px.bar(city_sales, x='City', y='Order Count', hover_data=['City', 'Order Count'],
labels={'City': 'City', 'Order Count': 'Order Count'},
color = 'Order Count')


fig.update_traces(hovertemplate='City: %{x}<br>Order Count: $%{y}<extra></extra>')

fig.update_layout(title='Total Sales by City',width=1200, height=600)

fig.update_traces(text=city_sales['Order Count'],
texttemplate='%{text:.2f}',
textfont=dict(color='gray', size=14, family='Arial'),
textangle=0,textposition='outside')


fig.show()

I will create a donut chart to visualize the total amounts of invoices representing our customer segments. There are three main customer segments in our dataset: Consumer customers (Consumer), Corporate customers (Corporate) and Home Office segment. This pie chart will be used to show the ratio of invoice amount to total revenue for each customer segment. Thus, we will be able to see more clearly which customer segment generates the bulk of the operating income. By highlighting different segments with colors, we will be able to make a visual comparison of customer segments through each slice on the chart. This pie chart will help us understand the revenue contributions of customer segments and better shape our business strategies.

segment_sales = df.groupby('Segment').agg({"Sales":"sum"}).sort_values(by = "Sales", ascending = False)
segment_sales = segment_sales[['Sales']]
segment_sales .reset_index(inplace = True)
total_revenues = segment_sales["Sales"].sum()
total_revenues = str(int(total_revenues))
total_revenues = '$' + total_revenues


plt.rcParams["figure.figsize"] = (10,6)
plt.rcParams['font.size'] = 11.0
plt.rcParams['font.weight'] = 6
colors = ['#BC243C','#FE840E','#C62168']
explode = (0.05,0.05,0.05)
fig1, ax1 = plt.subplots()
ax1.pie(segment_sales['Sales'], colors = colors, labels=segment_sales['Segment'], autopct= autopct_format(segment_sales['Sales']),startangle=90,explode=explode)
centre_circle = plt.Circle((0,0),0.85,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
ax1.axis('equal')
label = ax1.annotate('Total Revenue \n'+str(total_revenues),color = 'red', xy=(0, 0), fontsize=12, ha="center")
plt.tight_layout()
plt.show()

In our dataset, there are 3 segments: Corporate, Home Office and Consumer. These customer segments indicate the nature of the customers who shop in the store. (Another possibility may be specifying the type of invoices the store has issued.)

When we look at the distribution in the data set, we see that the most individual customers shop. Corporate followed by consumers with a $688,494 share.

Well, I wonder in which categories and in which sub-categories these sales took place.

topcat_subcat = df.groupby(['Category','Sub-Category']).agg({'Sales':'sum'}).sort_values(by = "Sales", ascending=False)[:10]
topcat_subcat = topcat_subcat[["Sales"]].astype(int)
topcat_subcat = topcat_subcat.sort_values("Category")
topcat_subcat.reset_index(inplace=True)
topcat_subcat_1 = topcat_subcat.groupby(['Category']).agg({'Sales':'sum'})
topcat_subcat_1.reset_index(inplace=True)


plt.rcParams["figure.figsize"] = (15,10)
fig, ax = plt.subplots()
ax.axis('equal')
width = 0.1
outer_colors = ['#FE840E','#009B77','#BC243C']
inner_colors = ['Orangered','tomato','coral',"darkturquoise","mediumturquoise","paleturquoise","lightpink","pink","hotpink","deeppink"]
pie = ax.pie(topcat_subcat_1['Sales'], radius=1, labels=topcat_subcat_1['Category'],colors=outer_colors,wedgeprops=dict(edgecolor='w'))
pie2 = ax.pie(topcat_subcat['Sales'], radius=1-width, labels=topcat_subcat['Sub-Category'],autopct= autopct_format(topcat_subcat['Sales']),labeldistance=0.7,colors=inner_colors,wedgeprops=dict(edgecolor='w'), pctdistance=0.53,rotatelabels =True)
fraction_text_list = pie2[2]
for text in fraction_text_list:
text.set_rotation(315)
centre_circle = plt.Circle((0,0),0.6,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
ax1.axis('equal')
plt.tight_layout()
plt.show()

I am creating a donut chart to look at the distribution of Sales values according to the Ship mode variable in the data set.

Top_shipping = df.groupby(["Ship Mode"]).agg({'Sales':'sum'}).sort_values(by = "Sales", ascending=False) 
Top_shipping = Top_shipping[["Sales"]]
Top_shipping.reset_index(inplace=True)
total_revenue_ship = Top_shipping["Sales"].sum()
total_revenue_ship = str(int(total_revenue_ship))
total_revenue_ship = '$' + total_revenue_ship



plt.rcParams["figure.figsize"] = (13,5)
plt.rcParams['font.size'] = 12.0
plt.rcParams['font.weight'] = 6
colors = ['#BC243C','#FE840E','#C62168',"limegreen"]
fig1, ax1 = plt.subplots()
ax1.pie(Top_shipping['Sales'], colors = colors, labels=Top_shipping['Ship Mode'], autopct= autopct_format(Top_shipping['Sales']), startangle=90)
centre_circle = plt.Circle((0,0),0.82,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

ax1.axis('equal')
label = ax1.annotate('Total Revenue \n'+str(total_revenue_ship),color = 'red', xy=(0, 0), fontsize=12, ha="center")
plt.tight_layout()
plt.show()

When we examine the purchased products according to the type of shipment, the result we see is that the shipment was made in the standard class category at most. The highest technology expenditures were realized in shipments made in First Class and Same Day class.

We examined the distribution of the Sales variable according to the various categorical variables we have. Now I want to examine the Product Name variable, which contains many unique values. I want to see the products with the highest number of sales and the products with the highest income, respectively, and then I want to see the products with the highest sales with a donut chart.

df_pop_items = df[['Product Name', 'Sales', 'Order ID']].groupby(['Product Name']).agg({'Sales' : 'sum','Order ID' : 'count'}).rename(columns = {'Order ID' : 'Order Count', 'Sales' : 'Total Sales'}).reset_index().sort_values(by  = 'Order Count', ascending = False)[:10]
df_pop_items2 = df[['Product Name', 'Sales', 'Order ID']].groupby(['Product Name']).agg({'Sales' : 'sum','Order ID' : 'count'}).rename(columns = {'Order ID' : 'Order Count', 'Sales' : 'Total Sales'}).reset_index().sort_values(by = 'Total Sales', ascending = False)[:10]

print('Best Selling Products (By Order Amount)' , df_pop_items, sep = '\n', end = '\n\n\n')
print('Best Selling Products (By Sales Amount)' , df_pop_items2, sep = '\n', end = '\n\n')



# Output :

Best Selling Products (By Order Amount)
Product Name Total Sales Order Count
1492 Staple envelope 1675.65 47
1498 Staples 755.47 46
537 Easy-staple paper 2414.16 44
259 Avery Non-Stick Binders 217.32 20
1495 Staple remover 263.09 18
1499 Staples in misc. colors 459.66 18
941 KI Adjustable-Height Table 4466.66 17
1510 Storex Dura Pro Binders 278.59 17
1496 Staple-based wall hangings 422.29 16
992 Logitech 910-002974 M325 Wireless Mouse for We... 1409.53 15


Best Selling Products (By Sales Amount)
Product Name Total Sales Order Count
404 Canon imageCLASS 2200 Advanced Copier 61599.82 5
649 Fellowes PB500 Electric Punch Plastic Comb Bin... 27453.38 10
444 Cisco TelePresence System EX90 Videoconferenci... 22638.48 1
785 HON 5400 Series Task Chairs for Big and Tall 21870.58 8
685 GBC DocuBind TL300 Electric Binding System 19823.48 11
687 GBC Ibimaster 500 Manual ProClick Binding System 19024.50 9
804 Hewlett Packard LaserJet 3310 Copier 18839.69 8
786 HP Designjet T520 Inkjet Large Format Printer ... 18374.90 3
682 GBC DocuBind P400 Electric Binding System 17965.07 6
812 High Speed Automatic Electric Letter Opener 17030.31 3

In the above list, we see the products with the highest sales in terms of income (Total Sales) and the number of purchases (Item Count) provided to the store. The highest sales amount is seen as ‘Staple Envelope’ according to the number of units.

When we sort the list in this way, we see more of the effect on the sales value of some products. For example, the price paid for the product named ‘Cisco TelePresence System EX90 Videoconferenci…’ is $22638, even if one purchase is made. It’s a pretty expensive item.

Top_products = df.groupby(["Product Name"]).agg({'Sales':'sum'}).sort_values("Sales",ascending=False).head(8) 
Top_products = Top_products[["Sales"]].round(2)
Top_products.reset_index(inplace=True)
total_revenue_products = '$' +str(int(Top_products["Sales"].sum()))

plt.rcParams["figure.figsize"] = (13,7)
plt.rcParams['font.size'] = 12.0
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99','#55B4B0','#E15D44','#009B77','#B565A7'] # colors are defined for the pie chart
explode = (0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05)
fig1, ax1 = plt.subplots()
ax1.pie(Top_products['Sales'], colors = colors, labels=Top_products['Product Name'], autopct= autopct_format(Top_products['Sales']), startangle=90,explode=explode)
centre_circle = plt.Circle((0,0),0.80,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

ax1.axis('equal')
label = ax1.annotate('Total Revenue \n of these products \n'+str(total_revenue_products),color = 'red', xy=(0, 0), fontsize=12, ha="center")
plt.tight_layout()
plt.show()

We completed our exploratory data analysis study based on the categorical variables in the data set. If you remember, we had two date type variables in the data set, Ship Date and Order Date. In the current stage, I want to analyze the time-dependent changes of our Sales variable according to the Order Date between these two variables. Next, I will once again use line charts to visualize how the Sales variable changes over time across different categorical variables.

df_time = df.copy()

df_time['Order Date'] = pd.to_datetime(df['Order Date'], format='%d/%m/%Y')
df_time['Ship Date'] = pd.to_datetime(df['Ship Date'], format='%d/%m/%Y')

df_time['Month-Year'] = df_time['Order Date'].dt.strftime('%b-%Y')

monthly_sales = df_time.groupby('Month-Year').agg({'Sales': 'sum'}).reset_index()
monthly_sales['Order'] = pd.to_datetime(monthly_sales['Month-Year'], format='%b-%Y')
monthly_sales = monthly_sales.sort_values('Order')


fig = px.line(monthly_sales, x='Month-Year', y='Sales', title='Monthly Sales',
labels={'Month-Year': 'Month-Year', 'Sales': 'Sales'})

fig.update_traces(hovertemplate='Date: %{x}<br>Sales: %{y:,.2f}<extra></extra>')


fig.update_xaxes(tickangle=90, tickmode='array', tickvals=monthly_sales['Month-Year'],
ticktext=monthly_sales['Month-Year'])


fig.update_layout(width=900, height=600)


fig.show()

When we analyze the change in sales values over time, it initially appears that there is a trend in the series. However, it would be premature to draw conclusions based solely on appearance. To provide a more accurate interpretation, we can refer to the Augmented Dickey-Fuller (ADF) test. I will proceed with our time series analysis.

def time_breakdown(dataframe = dataset, date_col = 'Order Date',categorical_columns = cat_cols, threshold = 15):
dataframe[f'Month_Year_{date_col}'] = dataframe[date_col].dt.strftime('%b-%Y')
for cat_col in categorical_columns:
if dataframe[cat_col].nunique() <= threshold:
monthly_sales = dataframe.groupby([f'Month_Year_{date_col}',cat_col]).agg({'Sales': 'sum'}).reset_index()
monthly_sales['Order'] = pd.to_datetime(monthly_sales[f'Month_Year_{date_col}'], format='%b-%Y')
monthly_sales = monthly_sales.sort_values('Order')


fig = px.line(monthly_sales, x=f'Month_Year_{date_col}', y='Sales', title='Monthly Sales',
labels={f'Month-Year_{date_col}': f'Month_Year_{date_col}', 'Sales': 'Sales'}, color = cat_col)

fig.update_traces(hovertemplate='Date: %{x}<br>Sales: %{y:,.2f}<extra></extra>')


fig.update_xaxes(tickangle=90, tickmode='array', tickvals=monthly_sales[f'Month_Year_{date_col}'],
ticktext=monthly_sales[f'Month_Year_{date_col}'])


fig.update_layout(width=1000, height=600)


fig.show()
time_breakdown(df_time, 'Order Date', categorical_columns = categorical_columns, threshold = 10)

I create a function to analyze the variation in sales values over time based on the categorical variables in our dataset. This function efficiently presents the sales value breakdown of the specified categorical columns as a list, in relation to time.

Now, I am generating new time variables by utilizing the Order Date column we possess. Subsequently, I will explore the fluctuations in product sales over time.

df_time['WeekDay'] = df_time['Order Date'].dt.day_name()

start_date = df_time['Order Date'].min()
df_time['Week'] = ((((df_time['Order Date'] - start_date).dt.days)/7) +1).astype(int)

We have previously seen the Sales Change according to the Order Date variable. But I didn’t create this chart day by day. Now it’s time to recreate the chart.

df_plot = df_time.groupby('Order Date').agg({'Sales':'sum'}).rename(columns = {'Sales':'Total Sales'}).reset_index().sort_values(by = 'Order Date', ascending = True)

fig = px.line(df_plot, x="Order Date", y="Total Sales", title='Total Sales by Date')
fig.show()

As observed in the figure, there is no apparent pattern in sales concerning the number of orders. However, upon closer examination, we can identify a seasonality trend.When we create a line plot based on the Total Sales Value, we do not discern a clear seasonality pattern.Furthermore, there is no distinct daily trend in the data. We will conduct tests to ascertain the stationarity and seasonality of the data.My next objective is to investigate whether there are variations in sales on different days of the week.In other words, I plan to group the days of the week to determine if sales peak on specific days.

day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df_plot_dayofweek = df_time[['WeekDay', 'Sales', 'Order ID']].groupby(['WeekDay']).agg({'Sales': 'sum', 'Order ID': 'count'}).rename(columns = {'Sales': 'Total Sales', 'Order ID':'Order Count'}).reset_index()

df_plot_dayofweek = df_plot_dayofweek.set_index('WeekDay').loc[day_order].reset_index()

fig_dayofweek = px.bar(df_plot_dayofweek, x = 'WeekDay', y = 'Order Count',hover_data=['WeekDay', 'Order Count'],
labels={'WeekDay': 'WeekDay', 'Order Count': 'Order Count'},
color = 'Order Count')

fig_dayofweek.update_traces(hovertemplate='WeekDay: %{x}<br>Order Count: %{y}<extra></extra>')

fig_dayofweek.update_layout(title='Total Sales by Day Of Week',width=1200, height=600)

fig_dayofweek.update_traces(text=df_plot_dayofweek['Order Count'],
texttemplate='%{text:.2f}',
textfont=dict(color='gray', size=14, family='Arial'),
textangle=0,textposition='outside')
fig_dayofweek.show()

As evident in the chart, sales experience peaks on certain Tuesdays and gradually decrease until Thursday. During the weekends, they begin to rise again.

day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df_plot_dayofweek = df_time[['WeekDay', 'Sales', 'Order ID']].groupby(['WeekDay']).agg({'Sales': 'sum', 'Order ID': 'count'}).rename(columns = {'Sales': 'Total Sales', 'Order ID':'Order Count'}).reset_index()

df_plot_dayofweek = df_plot_dayofweek.set_index('WeekDay').loc[day_order].reset_index()

fig_dayofweek = px.bar(df_plot_dayofweek, x = 'WeekDay', y = 'Total Sales',hover_data=['WeekDay', 'Total Sales'],
labels={'WeekDay': 'WeekDay', 'Total Sales': 'Total Sales'},
color = 'Total Sales')

fig_dayofweek.update_traces(hovertemplate='WeekDay: %{x}<br>Total Sales: $%{y}<extra></extra>')

fig_dayofweek.update_layout(title='Total Sales by Day Of Week',width=1200, height=600)

fig_dayofweek.update_traces(text=df_plot_dayofweek['Total Sales'],
texttemplate='$%{text:.2f}',
textfont=dict(color='gray', size=14, family='Arial'),
textangle=0,textposition='outside')
fig_dayofweek.show()

When we examine the sales values and the number of products sold separately by day, we notice that the highest product sales occur on Tuesday, followed by Tuesday, Saturday, and Monday. However, in terms of revenue, the days with the highest product sales are Saturday, Tuesday, Monday, and Wednesday. Interestingly, Sunday, which ranks highest in terms of product sales, falls to a lower rank in terms of revenue.

Based on this observation, we can conclude that items with higher monetary value are typically ordered on Saturdays. Since Tuesday tops both charts, it implies that products sold on Tuesdays are not only numerous but also valuable. On Wednesdays, similar to Saturdays, there is a lower volume of orders, but these orders tend to be more valuable

Now I want to reduce my time series analysis to a slightly smaller lens and look at the course of sales over a year. I would like to continue my work by creating a variable named single_year to examine the sales made in 2017 at random from the series.

SINGLE_YEAR = 2017

df_sales_single_year = df_time[df_time['Order Date'].dt.year == SINGLE_YEAR]
df_plot_single_year = df_sales_single_year[['Order Date','Sales','Order ID']].groupby(['Order Date']).agg({'Sales': 'sum', 'Order ID':'count'}).rename(columns = {'Sales':'Total Sales', 'Order ID': 'Order Count'}).reset_index()
fig = px.bar(df_plot_single_year, x='Order Date', y='Order Count',facet_col_spacing=0, title=f'Total Sales by Day {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600, width = 1200)
fig.show()
df_plot = df_sales_single_year[['Order Date','Category', 'Sales','Order ID']].groupby(['Order Date','Category']).agg({'Sales':'sum', 'Order ID':'count'}).rename(columns = {'Sales': 'Total Sales', 'Order ID': 'Order Count'}).reset_index()
fig = px.bar(df_plot, x='Order Date', y='Total Sales',color="Category", title=f'Item Sales by Date - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600, width = 1200, showlegend = True)
fig.show()
df_plot_weekly = df_sales_single_year[['Week','Category','Order ID','Sales']].groupby(['Week','Category']).agg({'Sales':'sum', 'Order ID': 'count'}).rename(columns = {'Sales': 'Total Sales', 'Order ID' : 'Order Count'}).reset_index()
fig = px.bar(df_plot_weekly, x='Week', y='Total Sales',color="Category", title=f'Total Sales by Week - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600 , width = 1200, showlegend = True)
fig.show()
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

df_plot_daily = df_time[['WeekDay','Category','Sales']].groupby(['WeekDay','Category']).agg({'Sales':'sum'}).rename(columns = {'Sales':'Total Sales'}).reset_index()
df_plot_daily = df_plot_daily.set_index('WeekDay').loc[day_order].reset_index()
fig = px.bar(df_plot_daily, x='WeekDay', y='Total Sales',color="Category", title=f'Total Sales by Day of Week - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600 , width = 900, showlegend = True)
fig.show()
df_pop_cats = df_time[['Sub-Category', 'Sales', 'Order ID']].groupby(['Sub-Category']).agg({'Sales' : 'sum',"Order ID": lambda orderid: orderid.nunique()}).rename(columns = {'Order ID' : 'Order Amount', 'Sales' : 'Total Sales'}).reset_index().sort_values(by  = 'Total Sales', ascending = False)
df_pop_cats[:10]
df_plot_sc = df_sales_single_year[['Order Date','Sub-Category', 'Sales']].groupby(['Order Date','Sub-Category']).agg({'Sales':'sum'}).rename(columns = {'Sales': 'Total Sales'}).reset_index()
fig = px.bar(df_plot_sc, x='Order Date', y='Total Sales',color="Sub-Category", title=f'Item Sales by Date - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600, width = 900)
fig.show()
df_plot_sc_weekly = df_sales_single_year[['Week','Sub-Category','Sales']].groupby(['Week','Sub-Category']).agg({'Sales':'sum'}).rename(columns = {'Sales': 'Total Sales'}).reset_index()
fig = px.bar(df_plot_sc_weekly, x='Week', y='Total Sales',color="Sub-Category", title=f'Total Sales by Week - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600 , width = 900)
fig.show()
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

df_plot_sc_daily = df_time[['WeekDay','Sub-Category','Sales']].groupby(['WeekDay','Sub-Category']).agg({'Sales':'sum'}).rename(columns = {'Sales':'Total Sales'}).reset_index()
df_plot_sc_daily = df_plot_sc_daily.set_index('WeekDay').loc[day_order].reset_index()
fig = px.bar(df_plot_sc_daily, x='WeekDay', y='Total Sales',color="Sub-Category", title=f'Total Sales by Day of Week - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600 , width = 900)
fig.show()

When analyzing sales changes in the Phones category, which boasts the highest sales value among sub-categories, we notice that peak phone sales occurred on Monday and the following Tuesday. Thursday experienced the lowest phone sales due to a general decline in store sales, but it’s noteworthy that the phone subcategory still represented the largest share of daily sales on Thursdays.

Another distinct sales pattern emerges in the Machines subcategory, where machine sales appear to reach their peak on Wednesdays and Sundays.

Within the furniture section, the Chairs subcategory exhibits a consistent sales trend. Upon closer inspection, it becomes evident that Tuesdays and Saturdays are the days when the most chairs are sold.

Before proceeding to the modeling and preprocessing stages, I intend to evaluate the stationarity and seasonality of the time series data as part of the time series analysis. My aim is to determine whether any seasonality or stationarity exists within our data.

I completed my time series analysis studies based on the Order Date variable. Now I want to test the stationarity of the time series.

df_time['Month-Year'] = df_time['Order Date'].dt.strftime('%b-%Y')

monthly_sales = df_time.groupby('Month-Year').agg({'Sales': 'sum'}).reset_index()
monthly_sales['Order'] = pd.to_datetime(monthly_sales['Month-Year'], format='%b-%Y')
monthly_sales = monthly_sales.sort_values('Order')


fig = px.line(monthly_sales, x='Month-Year', y='Sales', title='Monthly Sales',
labels={'Month-Year': 'Month-Year', 'Sales': 'Sales'})

fig.update_traces(hovertemplate='Date: %{x}<br>Sales: $%{y:,.2f}<extra></extra>')


fig.update_xaxes(tickangle=90, tickmode='array', tickvals=monthly_sales['Month-Year'],
ticktext=monthly_sales['Month-Year'])


fig.update_layout(width=900, height=600)


fig.show()

At first glance, we can readily observe a clear trend in the time series data. When we group the sales data by Month-Year, it becomes evident that sales are consistently increasing over time.

The other tests I will conduct include assessments for seasonality, cyclical patterns, and stagnation. In the initial overview, we observe that the time series data exhibits seasonality, reaching peak points at specific intervals. The concept of cyclicality is somewhat similar to seasonality, even though it may not have a substantial impact on forecasting processes. Understanding it can be valuable. It refers to a series repeating a certain pattern at specific intervals. For instance, we expect ice cream sales to rise during the summer months. If we were to model ice cream sales, we would find that they consistently follow the same pattern during the summer months.

plt.figure(figsize = (15,6))
plt.style.use('seaborn-colorblind')
plt.grid(True, alpha = 0.6)

sns.kdeplot(df.sort_values(by= ['Order Date'], ascending=True).loc[(df['Order Date'] == '2015-03-24'), 'Sales'], label = '2015-March')
sns.kdeplot(df.sort_values(by= ['Order Date'], ascending=True).loc[(df['Order Date'] == '2016-03-24'), 'Sales'], label = '2016-March')
sns.kdeplot(df.sort_values(by= ['Order Date'], ascending=True).loc[(df['Order Date'] == '2017-03-24'), 'Sales'], label = '2017-March')

plt.xlim(left = 0, right = 5000)
plt.show()

As you can see, when I look at the distribution of same-day sales data in 3 different years, the data is not normally distributed. That’s why I want to use the median value instead of the average in the resample function to examine sales data within a day.

df['Order Date'] = pd.to_datetime(df['Order Date'], format='%d/%m/%Y')
df_sorted = df.sort_values(by= ['Order Date'], ascending=True)

df_sorted = df_sorted.set_index("Order Date")


df_time = pd.DataFrame(df_sorted['Sales'])

df_time = pd.DataFrame(df_time['Sales'].resample('D').median())
df_time = df_time.interpolate(method='linear')
from statsmodels.tsa.stattools import adfuller
adf = adfuller(df_time)

print(adf)
print('\nADF = ', str(adf[0]))
print('\np-value = ', str(adf[1]))
print('\nCritical Values: ')

for key, val in adf[4].items(): #for loop to print the p-value (1%, 5% and 10%) and their respective values
print(key,':',val)

if adf[0] < val:
print('Null Hypothesis Rejected. Time Series is Stationary')
else:
print('Null Hypothesis Accepted. Time Series is not Stationary')

To test whether the series is stationary, we use the adfuller test in the statsmodels library. According to this test, we got the result that our data set is stationary. This is interesting because when we drew the line graph, we saw that the dataset was in an upward trend. Let’s do our other test..

import matplotlib.pyplot as plt

def is_stationary(y):

# "HO: Non-stationary"
# "H1: Stationary"

p_value = sm.tsa.stattools.adfuller(y)[1]
if p_value < 0.05:
print(F"Result: Stationary (H0: non-stationary, p-value: \
{round(p_value, 3)})")
else:
print(F"Result: Non-Stationary (H0: non-stationary, p-value: \
{round(p_value, 3)})")

def ts_decompose(y, model="additive", stationary=False):
result = sm.tsa.seasonal_decompose(y, model=model)
fig, axes = plt.subplots(4, 1, sharex=True, sharey=False)
fig.set_figheight(8)
fig.set_figwidth(10)

axes[0].set_title("Decomposition for " + model + " model")
axes[0].plot(y, 'k', label='Original ' + model)
axes[0].legend(loc='upper left')

axes[1].plot(result.trend, label='Trend')
axes[1].legend(loc='upper left')

axes[2].plot(result.seasonal, 'g', label='Seasonality & Mean: ' + str(round(result.seasonal.mean(), 4)))
axes[2].legend(loc='upper left')

axes[3].plot(result.resid, 'r', label='Residuals & Mean: ' + str(round(result.resid.mean(), 4)))
axes[3].legend(loc='upper left')
plt.show(block=True)

if stationary:
is_stationary(y)

We now agree that the Dataset is stationary. With the tests we have done, we have seen that our interpretation of ‘The data set is in an upward trend’ is not true.

I would like to explain the above chart simply. In this image, the top chart provides us with the level information of the data set and allows us to make up our minds about whether the series is stationary or not. The second chart provides trend information, the third chart seasonality information. Finally, the Residuals graph shows the residuals. The residuals show the deviation levels around the data. In the expected case, the mean of the residues around zero is 0.

I finished to EDA & Time Series Analysis, I will build some ML or DL model. See u soon :)

--

--