The Art of Forecasting in the Retail Industry

Multivariate Time Series Forecasting

Part I : Exploratory Data Analysis & Time Series Analysis

27 min readJul 2, 2023

In this article, I will conduct exploratory data analysis and time series analysis using a dataset consisting of product sales in different categories from a store in the US between 2015 and 2018. Following these steps, I will preprocess the data and perform a multivariate forecasting study.

Hello, today I am thrilled to finally sit in front of my computer and delve into an exciting data science project that I have been putting off for quite some time. In this article, I will share the first part of my two-phase project. Through exploratory data analysis and time series analysis, I will embark on a captivating journey into the realm of my dataset. Here, I will uncover the secrets behind sales trends in different categories of products. By visualizing the data, we will witness the fluctuations in sales figures and then delve deeply into the intricate dynamics of these patterns over time.

I utilized the Plotly library as a visualization tool to gain insights from my dataset. Plotly proved to be immensely helpful in creating interactive graphs for visualizing the data. I highly recommend using this library for data visualization purposes. Given that my dataset is a time series dataset, this article includes an exploratory data analysis section covering various aspects of time series analysis, such as stationarity, seasonality, and cyclicality.

Now I can start by importing the libraries that I will be using in the study.

import numpy as np 
import pandas as pd 
import seaborn as sns
from datetime import datetime as dt


import plotly.express as px
import plotly.graph_objects as go


import itertools

import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.tsa.api as smt

import warnings
warnings.filterwarnings('ignore')
#pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None) 
pd.set_option('display.float_format', lambda x: '%.f' % x)
pd.set_option('display.expand_frame_repr', False)

After importing the necessary libraries, I proceed to import the dataset that I will be using for this project. The work presented in this article is based on my notebook published on Kaggle, where I shared my findings and analysis. By leveraging the power of data science techniques, I aim to offer valuable insights and illustrate the step-by-step process involved in this project. The dataset serves as a valuable resource to demonstrate the practical application of various data analysis and forecasting techniques.

dataset = pd.read_csv('/kaggle/input/sales-forecasting/train.csv')
df = dataset.copy()

Well, once I’ve imported the dataset, I begin my work. In the first step, I define a function called ‘check_df.’ This function is highly valuable for comprehending the overall structure of the dataset. It offers access to various information, including the dataset’s dimensions, feature names, data types, and the ability to review the first and last 5 rows of the dataset. Furthermore, I aim to uncover valuable patterns and trends within the dataset by analyzing the statistical properties of numerical variables

def check_df(dataset, head = 5):
    print('#'*30 + 'Shape of Dataset' + '#'*30, end = '\n'*2)
    print(dataset.shape, end = '\n'*2)
    print('#'*30 + 'General informations about to Dataset' + '#'*30, end = '\n'*2)
    print(dataset.info(), end = '\n'*2)
    print('#'*30 + 'First 5 Lines Of Dataset' + '#'*30, end = '\n'*2)
    print(dataset.head(head), end = '\n'*2)
    print('#'*30 + 'NaN values of Features' + '#'*30, end = '\n'*2)
    print(dataset.isnull().sum(), end = '\n'*2)
    print('#'*30 + 'Descriptive Statistics of Numerical Features' + '#'*30, end = '\n'*2)
    print(dataset.describe().T, end = '\n'*2)
    print('#'*30 + 'Quantiles of Numerical Features' + '#'*30, end ='\n'*2)
    print(dataset.describe([0,0.10, 0.25, 0.50,0.75,0.99]).T, end = '\n'*2)

check_df(df)

When I run the function called check_df that I created, the general view of the dataset is as follows:

##############################Shape of Dataset##############################

(9800, 18)

##############################General informations about to Dataset##############################

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9800 non-null   int64  
 1   Order ID       9800 non-null   object 
 2   Order Date     9800 non-null   object 
 3   Ship Date      9800 non-null   object 
 4   Ship Mode      9800 non-null   object 
 5   Customer ID    9800 non-null   object 
 6   Customer Name  9800 non-null   object 
 7   Segment        9800 non-null   object 
 8   Country        9800 non-null   object 
 9   City           9800 non-null   object 
 10  State          9800 non-null   object 
 11  Postal Code    9789 non-null   float64
 12  Region         9800 non-null   object 
 13  Product ID     9800 non-null   object 
 14  Category       9800 non-null   object 
 15  Sub-Category   9800 non-null   object 
 16  Product Name   9800 non-null   object 
 17  Sales          9800 non-null   float64
dtypes: float64(2), int64(1), object(15)
memory usage: 1.3+ MB
None

##############################First 5 Lines Of Dataset##############################

   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID    Customer Name    Segment        Country             City       State  Postal Code Region       Product ID         Category Sub-Category                                       Product Name  Sales
0       1  CA-2017-152156  08/11/2017  11/11/2017    Second Class    CG-12520      Claire Gute   Consumer  United States        Henderson    Kentucky        42420  South  FUR-BO-10001798        Furniture    Bookcases                  Bush Somerset Collection Bookcase    262
1       2  CA-2017-152156  08/11/2017  11/11/2017    Second Class    CG-12520      Claire Gute   Consumer  United States        Henderson    Kentucky        42420  South  FUR-CH-10000454        Furniture       Chairs  Hon Deluxe Fabric Upholstered Stacking Chairs,...    732
2       3  CA-2017-138688  12/06/2017  16/06/2017    Second Class    DV-13045  Darrin Van Huff  Corporate  United States      Los Angeles  California        90036   West  OFF-LA-10000240  Office Supplies       Labels  Self-Adhesive Address Labels for Typewriters b...     15
3       4  US-2016-108966  11/10/2016  18/10/2016  Standard Class    SO-20335   Sean O'Donnell   Consumer  United States  Fort Lauderdale     Florida        33311  South  FUR-TA-10000577        Furniture       Tables      Bretford CR4500 Series Slim Rectangular Table    958
4       5  US-2016-108966  11/10/2016  18/10/2016  Standard Class    SO-20335   Sean O'Donnell   Consumer  United States  Fort Lauderdale     Florida        33311  South  OFF-ST-10000760  Office Supplies      Storage                     Eldon Fold 'N Roll Cart System     22

##############################NaN values of Features##############################

Row ID            0
Order ID          0
Order Date        0
Ship Date         0
Ship Mode         0
Customer ID       0
Customer Name     0
Segment           0
Country           0
City              0
State             0
Postal Code      11
Region            0
Product ID        0
Category          0
Sub-Category      0
Product Name      0
Sales             0
dtype: int64

##############################Descriptive Statistics of Numerical Features##############################

             count  mean   std  min   25%   50%   75%   max
Row ID        9800  4900  2829    1  2451  4900  7350  9800
Postal Code   9789 55273 32041 1040 23223 58103 90008 99301
Sales         9800   231   627    0    17    54   211 22638

##############################Quantiles of Numerical Features##############################

             count  mean   std  min   0%   10%   25%   50%   75%   99%   max
Row ID        9800  4900  2829    1    1   981  2451  4900  7350  9702  9800
Postal Code   9789 55273 32041 1040 1040 10024 23223 58103 90008 98115 99301
Sales         9800   231   627    0    0     8    17    54   211  2480 22638

When we observe the general structure of the dataset, we see that we are working with a dataset containing 9800 observation lines and 18 variables. We also detect missing observations in the variable named ‘Postal Code’. By examining the missing observations, we aim to develop strategies on how to fill these values. In addition, there are a few more variables in our data set that, although they have numerical values, are actually categorical variables according to the information they contain. For example, the postcode and Row ID columns are represented numerically, even though they actually have a categorical attribute. We will move forward in converting these variables to categorical variables.

df.loc[(df["Postal Code"].isnull()), :][["City", "State"]].value_counts()

## Output ##

City        State  
Burlington  Vermont    11
dtype: int64

All 11 rows of missing observations in the dataset are observations from the same State-City group. With a simple Google search, I found the Postal Code for this region and filled in these values for the missing observations. In order to continue our work, we need to change the types of some variables. Columns ‘Ship Date’ and ‘Order Date’ are seen as ‘object’ as data types, but since these two variables actually contain date information, I want to convert them to ‘datetime’ format.

df.drop(columns = ['Row ID'], inplace = True) 

df.loc[(df["Postal Code"].isnull()), "Postal Code"] = 5401 

df["Postal Code"] = df["Postal Code"].astype("str") 

def convert_datetime(dataframe):     
    date_cols = [col for col in dataframe.columns if 'Date' in col]
   
    for col in date_cols:
        dataframe[col] = pd.to_datetime(dataframe[col], errors = 'coerce')
    return dataframe.info()

convert_datetime(df)



## Output ## :

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Order ID       9800 non-null   object        
 1   Order Date     9800 non-null   datetime64[ns]
 2   Ship Date      9800 non-null   datetime64[ns]
 3   Ship Mode      9800 non-null   object        
 4   Customer ID    9800 non-null   object        
 5   Customer Name  9800 non-null   object        
 6   Segment        9800 non-null   object        
 7   Country        9800 non-null   object        
 8   City           9800 non-null   object        
 9   State          9800 non-null   object        
 10  Postal Code    9800 non-null   object        
 11  Region         9800 non-null   object        
 12  Product ID     9800 non-null   object        
 13  Category       9800 non-null   object        
 14  Sub-Category   9800 non-null   object        
 15  Product Name   9800 non-null   object        
 16  Sales          9800 non-null   float64       
dtypes: datetime64[ns](2), float64(1), object(14)
memory usage: 1.3+ MB

We fixed the missing observations in the dataset and edited the wrong data type. After converting the Date variables in my dataset from object data type to Datetime data type, my next step involves categorizing the variables in the dataset into different types, including categorical, numerical etc.. Following this categorization, I will employ various visualizations to examine the class frequencies of the categorical variables.

def grab_col_name(dataframe, cat_th = 5, car_th = 49):
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].dtypes != "O" and dataframe[col].nunique() < cat_th]
    cat_but_car = [col for col in dataframe.columns if dataframe[col].dtypes == "O" and dataframe[col].nunique() > car_th]
    
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]
    
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes in ["int64", "float64"] and "ID" not in col.upper()]
    num_cols= [col for col in num_cols if col not in num_but_cat]
    
    date_cols = [col for col in dataframe.columns if 'Date' in col]
    
    for col in date_cols:
            dataframe[col] = pd.to_datetime(dataframe[col], errors = 'coerce')
            
    return cat_cols, num_cols, cat_but_car,date_cols


grab_col_name(df)

## Output  : 

(['Ship Mode', 'Segment', 'State', 'Region', 'Category', 'Sub-Category'],
 ['Sales'],
 ['Row ID',
  'Order ID',
  'Customer Name',
  'City',
  'Postal Code',
  'Product ID',
  'Product Name'],
 ['Order Date', 'Ship Date'])

##

cat_cols, num_cols, cat_but_car, date_cols = grab_col_name(df)

I am currently categorizing the variables in the dataset based on their types. Next, I will define a function to visualize the frequency of category classes in my categorical data.

# Function for Categorical Variables:

def cat_summary(dataframe, plot=False, categorical_columns = cat_cols, threshold=10):
    for col_name in categorical_columns:
        print("#" * 30 + " Unique Values Of " + col_name + " column " + "#" * 30)
        print(col_name + ": " + str(dataframe[col_name].nunique()))
        
        print("#" * 30 + " Frequencies of Class Values " + "#" * 30)
        value_counts = dataframe[col_name].value_counts()
        if dataframe[col_name].nunique() > threshold:
            top_n_values = value_counts.head(threshold)
            #other_count = value_counts[threshold:].sum()
            #top_n_values['Others'] = other_count
            value_counts = top_n_values
        
        print(pd.DataFrame({col_name: value_counts,
                            "Ratio": value_counts / len(dataframe[col_name])}))
        
        if plot:
            rgb_values = sns.color_palette("Set2", 6)
            sns.set_theme(style="darkgrid")
            
            if dataframe[col_name].nunique() > threshold:
                plt.figure(figsize=(13,6))  
            
            ax = sns.barplot(x=value_counts.index, y=value_counts.values, palette=rgb_values)
            ax.set_xticklabels(ax.get_xticklabels(), rotation=60)
            for p in ax.patches:
                ax.annotate(f'\n{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='top', color='white', size=10)
            plt.show()
            
# Function for Numerical Variables :

def num_summary(dataframe, plot = False):
    quantiles = [0, 0.25, 0.50, 0.75, 1.0]
    
    for col_name in num_cols:
        print("#"*30 + " Distribution of " + col_name+ " column" + "#"*30)
        print(dataframe[col_name].describe(quantiles).T)
        
        if plot:
            sns.histplot(data = dataframe, x = col_name)
            plt.xlabel(col_name)
            plt.title("The Distribution of "+ col_name)
            plt.grid(True, alpha = 0.5)
            plt.show(block = True)

# Function For Percent Format  
            
def autopct_format(values): 
    def my_format(pct): 
        total = sum(values) 
        val = int(round(pct*total/100.0))
        return ' ${v:d}'.format(v=val)
    return my_format

categorical_columns = ['Ship Mode', 'Segment', 'State','Region', 
                        'Category', 'Sub-Category', 'Customer Name', 'City']

cat_summary(df,True,categorical_columns = categorical_columns,threshold = 10)

############################## Unique Values Of Ship Mode column ##############################
Ship Mode: 4
############################## Frequencies of Class Values ##############################
                Ship Mode  Ratio
Standard Class       5859   0.60
Second Class         1902   0.19
First Class          1501   0.15
Same Day              538   0.05

When we examine the Ship Mode variable, we see that the class named Standard Class is predominantly in the data set. Orders are generally sent as standard.

############################## Unique Values Of Segment column ##############################
Segment: 3
############################## Frequencies of Class Values ##############################
             Segment  Ratio
Consumer        5101   0.52
Corporate       2953   0.30
Home Office     1746   0.18

In the segment variable, we see that the Consumer type customer group is dominant. In the second place is the name Corporate.

############################## Unique Values Of State column ##############################
State: 49
############################## Frequencies of Class Values ##############################
                State  Ratio
California       1946   0.20
New York         1097   0.11
Texas             973   0.10
Pennsylvania      582   0.06
Washington        504   0.05
Illinois          483   0.05
Ohio              454   0.05
Florida           373   0.04
Michigan          253   0.03
North Carolina    247   0.03

When we look at the frequencies of the states in the data set, we see that California is dominant.

############################## Unique Values Of Region column ##############################
Region: 4
############################## Frequencies of Class Values ##############################
         Region  Ratio
West       3140   0.32
East       2785   0.28
Central    2277   0.23
South      1598   0.16

When we look at the region variable, the distribution seems somewhat homogeneous and class frequencies seem close to each other. But West is seen more intensely.

############################## Unique Values Of Category column ##############################
Category: 3
############################## Frequencies of Class Values ##############################
                 Category  Ratio
Office Supplies      5909   0.60
Furniture            2078   0.21
Technology           1813   0.18

We see that there are 3 classes in the Category variable, which stores the categories of the products sold. I see that the Office Supplies class is dominant. I see Furniture in second place.

############################## Unique Values Of Sub-Category column ##############################
Sub-Category: 17
############################## Frequencies of Class Values ##############################
             Sub-Category  Ratio
Binders              1492   0.15
Paper                1338   0.14
Furnishings           931   0.10
Phones                876   0.09
Storage               832   0.08
Art                   785   0.08
Accessories           756   0.08
Chairs                607   0.06
Appliances            459   0.05
Labels                357   0.04

When I look at the distribution of class frequencies of the Sub-categories variable, we see that classes such as Binders and Papers are the most sold Sub-Categories.

############################## Unique Values Of City column ##############################
City: 529
############################## Frequencies of Class Values ##############################
               City  Ratio
New York City   891   0.09
Los Angeles     728   0.07
Philadelphia    532   0.05
San Francisco   500   0.05
Seattle         426   0.04
Houston         374   0.04
Chicago         308   0.03
Columbus        221   0.02
San Diego       170   0.02
Springfield     161   0.02

When we look at the City variable, we see that the class named New York City is dominant.

I utilize both histograms and box plots to visualize the distribution of sales values. When we examine the statistical characteristics of the sales values using the ‘check_df’ function, we find that the series has a mean of 231, a standard deviation of 627, and a median value of 54. Additionally, the maximum value observed is 22,638. Based on this data, it’s evident that our dataset exhibits a skewed distribution. Given that the median of the values in the dataset is smaller than the mean, we expect a right-skewed histogram. In such cases, it’s more appropriate to use the median as a measure of central tendency to represent our dataset.

import plotly.graph_objects as go
import plotly.subplots as sp

fig_histogram = go.Figure()

fig_histogram.add_trace(go.Histogram(x=df['Sales'], name='Sales', histnorm='probability', opacity=0.7))

fig_histogram.update_layout(
    title='Sales Distribution (Histogram)',
    xaxis_title='Sales',
    yaxis_title='Probability',
    barmode='overlay',
    bargap=0.1,
    legend=dict(x=0.7, y=0.95),
    autosize=False,
    width=500,
    height=400
)


fig_boxplot = go.Figure()

fig_boxplot.add_trace(go.Box(y=df['Sales'], name='Sales', boxpoints='all', jitter=0.3, pointpos=-1.8))

fig_boxplot.update_layout(
    title='Sales Distribution (Box Plot)',
    yaxis_title='Sales',
    autosize=False,
    width=500,
    height=400
)


mean = df['Sales'].mean()
median = df['Sales'].median()
max_value = df['Sales'].max()

fig_histogram.add_shape(
    type='line',
    x0=mean,
    x1=mean,
    y0=0,
    y1=1,
    line=dict(color='red', width=3, dash='dash'),
    name='Mean'
)

fig_histogram.add_shape(
    type='line',
    x0=median,
    x1=median,
    y0=0,
    y1=1,
    line=dict(color='green', width=3, dash='dash'),
    name='Median'
)

fig_histogram.add_shape(
    type='line',
    x0=max_value,
    x1=max_value,
    y0=0,
    y1=1,
    line=dict(color='blue', width=3, dash='dash'),
    name='Max'
)


fig_dashboard = sp.make_subplots(rows=1, cols=2, subplot_titles=("Histogram", "Box Plot"))

fig_dashboard.add_trace(fig_histogram['data'][0], row=1, col=1)
fig_dashboard.add_trace(fig_boxplot['data'][0], row=1, col=2)

fig_dashboard.update_layout(height=600, width=1200, title_text="Sales Distribution Dashboard")


fig_dashboard.show()

As we predicted above, our dataset exhibits a right-skewed distribution. Another remarkable point is that the sales value is at the maximum level. This value can be seen as quite an extreme value compared to the overall distribution of the data set and may even be considered an outlier. However, it would not be correct to say that this value is only contrary to the general data distribution. Let’s consider an example scenario; Let’s have Category and Sub-Category, which are different categorical variables in our dataset. Some of the sales that occur in the categories of these categorical variables may be high-priced products purchased in large quantities. This may result in the observed values in our data set. At this point, it is important to understand the relationship between numerical variables and categorical variables. We will carry out the next steps of our work with this focus. Now I will try to explore the categorical variables that interact with the target variable Sales in the dataset.

from matplotlib import pyplot as plt
from matplotlib import figure
def desc_stats(dataframe):
    desc = dataframe.describe(percentiles =[0, 0.25, 0.50, 0.75, 0.85, 0.95, 0.99, 1]).T
    desc_df = pd.DataFrame(index= dataframe.columns, 
                           columns= desc.columns,
                           data= desc)
    
    f,ax = plt.subplots(figsize=(20,
                                 desc_df.shape[0]*0.78))
    sns.heatmap(desc_df,
                annot=True,
                cmap = "Wistia",
                fmt= '.2f',
                ax=ax,
                linecolor='white',
                linewidths = 1.3,
                cbar = False,
                annot_kws={"size": 12})
    plt.xticks(size = 18)
    plt.yticks(size = 14,
               rotation = 0)
    plt.title("Descriptive Statistics", size = 14)
    plt.show()


desc_stats(df[[col for col in df.columns if df[col].dtype in ['int64', 'float64'] ]])

As we observed, the maximum value of the sales variable in the data set is much larger than the other quantile values. This indicates that sales values are generally low but increase significantly after 85%. For example, values up to 75% show a normal distribution, while a three-fold increase is observed between 85% and 95%. This may suggest that high resale values do occur, although rarely, and that customers with higher incomes usually shop in large volumes. By examining this relationship in more detail, we will try to understand the factors that affect sales value.

Now I will create a barplot to visualize the 10 highest revenue generating customers. This barplot will use the magnitude of sales values to color the bars representing each customer. This will enable us to more clearly highlight customers with higher sales values, allowing us to more easily identify the most important customers in terms of revenue. This visualization will help us identify the business’ main sources of revenue and the most valuable customer segments.

customer_sales = df.groupby('Customer Name').agg({'Sales':"sum"}).reset_index().sort_values(by = "Sales", ascending = False)[:10]


fig = px.bar(customer_sales, x='Customer Name', y='Sales', hover_data=['Customer Name', 'Sales'],
             labels={'Customer Name': 'Customer Name', 'Sales': 'Sales'}, 
             color = 'Sales')


fig.update_traces(hovertemplate='Customer Name: %{x}<br>Sales: $%{y}<extra></extra>')

fig.update_layout(title='Top 10 Customer',width=1200, height=600)

fig.update_traces(text=customer_sales['Sales'], 
                  texttemplate='$%{text:.2f}', 
                  textfont=dict(color='gray', size=14, family='Arial'),
                  textangle=0,textposition='outside')


fig.show()

I create Barplot to examine the Sales variable relative to the State variable. There are 49 different state values in the data set, but I will be showing the 10 states with the most sales.

state_sales = df.groupby('State').agg({'Sales':"sum", "Order ID": lambda orderid: orderid.nunique()}).reset_index().sort_values(by = "Sales", ascending = False)[:10]



fig = px.bar(state_sales, x='State', y='Sales', hover_data=['State', 'Sales'],
             labels={'State': 'State', 'Sales': 'Sales'}, 
             color = 'Sales')


fig.update_traces(hovertemplate='State: %{x}<br>Sales: $%{y}<extra></extra>')

fig.update_layout(title='Total Sales by State',width=1200, height=600)

fig.update_traces(text=state_sales['Sales'], 
                  texttemplate='$%{text:.2f}', 
                  textfont=dict(color='gray', size=14, family='Arial'),
                  textangle=0,textposition='outside')


fig.show()

city_sales = df.groupby('City').agg({'Sales':"sum", "Order ID": lambda orderid: orderid.nunique()}).reset_index().sort_values(by = "Sales", ascending = False)[:10]



fig = px.bar(city_sales, x='City', y='Sales', hover_data=['City', 'Sales'],
             labels={'City': 'City', 'Sales': 'Sales'}, 
             color = 'Sales')


fig.update_traces(hovertemplate='City: %{x}<br>Sales: $%{y}<extra></extra>')

fig.update_layout(title='Total Sales by City',width=1200, height=600)

fig.update_traces(text=city_sales['Sales'], 
                  texttemplate='$%{text:.2f}', 
                  textfont=dict(color='gray', size=14, family='Arial'),
                  textangle=0,textposition='outside')


fig.show()

When I examine the visualization, I see that the highest sales occurred in New York — NYC. However, the California state cities of LA, San Francisco and San Diego also have significant sales. One of the highlights on the chart is that California comes before New York. This shows that while New York has the most sales when compared at the City level, California is at the top of the list when compared at the State level. This shows that California’s overall sales volume is high and its different cities have a significant share of overall sales.

Alongside the sales values, I would like to analyze the number of orders received by the store for products purchased from these cities. I aim to determine whether there is a correlation between the sales value and the number of orders.

We examined sales values according to State and City. Finally, I create a barplot in the State variable to see the number of orders received.

city_sales = df.groupby('City').agg({"Order ID": lambda orderid: orderid.nunique()}).rename(columns = {'Order ID' : 'Order Count'}).reset_index().sort_values(by = "Order Count", ascending = False)[:10]



fig = px.bar(city_sales, x='City', y='Order Count', hover_data=['City', 'Order Count'],
             labels={'City': 'City', 'Order Count': 'Order Count'}, 
             color = 'Order Count')


fig.update_traces(hovertemplate='City: %{x}<br>Order Count: $%{y}<extra></extra>')

fig.update_layout(title='Total Sales by City',width=1200, height=600)

fig.update_traces(text=city_sales['Order Count'], 
                  texttemplate='%{text:.2f}', 
                  textfont=dict(color='gray', size=14, family='Arial'),
                  textangle=0,textposition='outside')


fig.show()

I will create a donut chart to visualize the total amounts of invoices representing our customer segments. There are three main customer segments in our dataset: Consumer customers (Consumer), Corporate customers (Corporate) and Home Office segment. This pie chart will be used to show the ratio of invoice amount to total revenue for each customer segment. Thus, we will be able to see more clearly which customer segment generates the bulk of the operating income. By highlighting different segments with colors, we will be able to make a visual comparison of customer segments through each slice on the chart. This pie chart will help us understand the revenue contributions of customer segments and better shape our business strategies.

segment_sales = df.groupby('Segment').agg({"Sales":"sum"}).sort_values(by = "Sales", ascending = False)
segment_sales = segment_sales[['Sales']]
segment_sales .reset_index(inplace = True)
total_revenues = segment_sales["Sales"].sum() 
total_revenues = str(int(total_revenues))
total_revenues = '$' + total_revenues 


plt.rcParams["figure.figsize"] = (10,6) 
plt.rcParams['font.size'] = 11.0 
plt.rcParams['font.weight'] = 6 
colors = ['#BC243C','#FE840E','#C62168'] 
explode = (0.05,0.05,0.05)
fig1, ax1 = plt.subplots()
ax1.pie(segment_sales['Sales'], colors = colors, labels=segment_sales['Segment'], autopct= autopct_format(segment_sales['Sales']),startangle=90,explode=explode)
centre_circle = plt.Circle((0,0),0.85,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
ax1.axis('equal') 
label = ax1.annotate('Total Revenue \n'+str(total_revenues),color = 'red', xy=(0, 0), fontsize=12, ha="center")
plt.tight_layout()
plt.show()

In our dataset, there are 3 segments: Corporate, Home Office and Consumer. These customer segments indicate the nature of the customers who shop in the store. (Another possibility may be specifying the type of invoices the store has issued.)

When we look at the distribution in the data set, we see that the most individual customers shop. Corporate followed by consumers with a $688,494 share.

Well, I wonder in which categories and in which sub-categories these sales took place.

topcat_subcat = df.groupby(['Category','Sub-Category']).agg({'Sales':'sum'}).sort_values(by = "Sales", ascending=False)[:10]
topcat_subcat = topcat_subcat[["Sales"]].astype(int)
topcat_subcat = topcat_subcat.sort_values("Category")
topcat_subcat.reset_index(inplace=True)
topcat_subcat_1 = topcat_subcat.groupby(['Category']).agg({'Sales':'sum'})
topcat_subcat_1.reset_index(inplace=True)


plt.rcParams["figure.figsize"] = (15,10) 
fig, ax = plt.subplots()
ax.axis('equal')
width = 0.1
outer_colors = ['#FE840E','#009B77','#BC243C']
inner_colors = ['Orangered','tomato','coral',"darkturquoise","mediumturquoise","paleturquoise","lightpink","pink","hotpink","deeppink"]
pie = ax.pie(topcat_subcat_1['Sales'], radius=1, labels=topcat_subcat_1['Category'],colors=outer_colors,wedgeprops=dict(edgecolor='w'))
pie2 = ax.pie(topcat_subcat['Sales'], radius=1-width, labels=topcat_subcat['Sub-Category'],autopct= autopct_format(topcat_subcat['Sales']),labeldistance=0.7,colors=inner_colors,wedgeprops=dict(edgecolor='w'), pctdistance=0.53,rotatelabels =True)
fraction_text_list = pie2[2]
for text in fraction_text_list: 
    text.set_rotation(315) 
centre_circle = plt.Circle((0,0),0.6,fc='white') 
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
ax1.axis('equal')  
plt.tight_layout()
plt.show()

I am creating a donut chart to look at the distribution of Sales values according to the Ship mode variable in the data set.

Top_shipping = df.groupby(["Ship Mode"]).agg({'Sales':'sum'}).sort_values(by = "Sales", ascending=False) 
Top_shipping = Top_shipping[["Sales"]]
Top_shipping.reset_index(inplace=True)
total_revenue_ship = Top_shipping["Sales"].sum()
total_revenue_ship = str(int(total_revenue_ship))
total_revenue_ship = '$' + total_revenue_ship 



plt.rcParams["figure.figsize"] = (13,5) 
plt.rcParams['font.size'] = 12.0 
plt.rcParams['font.weight'] = 6 
colors = ['#BC243C','#FE840E','#C62168',"limegreen"] 
fig1, ax1 = plt.subplots()
ax1.pie(Top_shipping['Sales'], colors = colors, labels=Top_shipping['Ship Mode'], autopct= autopct_format(Top_shipping['Sales']), startangle=90)
centre_circle = plt.Circle((0,0),0.82,fc='white') 
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

ax1.axis('equal') 
label = ax1.annotate('Total Revenue \n'+str(total_revenue_ship),color = 'red', xy=(0, 0), fontsize=12, ha="center")
plt.tight_layout()
plt.show()

When we examine the purchased products according to the type of shipment, the result we see is that the shipment was made in the standard class category at most. The highest technology expenditures were realized in shipments made in First Class and Same Day class.

We examined the distribution of the Sales variable according to the various categorical variables we have. Now I want to examine the Product Name variable, which contains many unique values. I want to see the products with the highest number of sales and the products with the highest income, respectively, and then I want to see the products with the highest sales with a donut chart.

df_pop_items = df[['Product Name', 'Sales', 'Order ID']].groupby(['Product Name']).agg({'Sales' : 'sum','Order ID' : 'count'}).rename(columns = {'Order ID' : 'Order Count', 'Sales' : 'Total Sales'}).reset_index().sort_values(by  = 'Order Count', ascending = False)[:10]
df_pop_items2 = df[['Product Name', 'Sales', 'Order ID']].groupby(['Product Name']).agg({'Sales' : 'sum','Order ID' : 'count'}).rename(columns = {'Order ID' : 'Order Count', 'Sales' : 'Total Sales'}).reset_index().sort_values(by  = 'Total Sales', ascending = False)[:10]

print('Best Selling Products (By Order Amount)' , df_pop_items, sep = '\n', end = '\n\n\n')
print('Best Selling Products (By Sales Amount)' , df_pop_items2, sep = '\n', end = '\n\n')



# Output :

Best Selling Products (By Order Amount)
                                           Product Name  Total Sales  Order Count
1492                                    Staple envelope      1675.65           47
1498                                            Staples       755.47           46
537                                   Easy-staple paper      2414.16           44
259                             Avery Non-Stick Binders       217.32           20
1495                                     Staple remover       263.09           18
1499                            Staples in misc. colors       459.66           18
941                          KI Adjustable-Height Table      4466.66           17
1510                            Storex Dura Pro Binders       278.59           17
1496                         Staple-based wall hangings       422.29           16
992   Logitech 910-002974 M325 Wireless Mouse for We...      1409.53           15


Best Selling Products (By Sales Amount)
                                          Product Name  Total Sales  Order Count
404              Canon imageCLASS 2200 Advanced Copier     61599.82            5
649  Fellowes PB500 Electric Punch Plastic Comb Bin...     27453.38           10
444  Cisco TelePresence System EX90 Videoconferenci...     22638.48            1
785       HON 5400 Series Task Chairs for Big and Tall     21870.58            8
685         GBC DocuBind TL300 Electric Binding System     19823.48           11
687   GBC Ibimaster 500 Manual ProClick Binding System     19024.50            9
804               Hewlett Packard LaserJet 3310 Copier     18839.69            8
786  HP Designjet T520 Inkjet Large Format Printer ...     18374.90            3
682          GBC DocuBind P400 Electric Binding System     17965.07            6
812        High Speed Automatic Electric Letter Opener     17030.31            3

In the above list, we see the products with the highest sales in terms of income (Total Sales) and the number of purchases (Item Count) provided to the store. The highest sales amount is seen as ‘Staple Envelope’ according to the number of units.

When we sort the list in this way, we see more of the effect on the sales value of some products. For example, the price paid for the product named ‘Cisco TelePresence System EX90 Videoconferenci…’ is $22638, even if one purchase is made. It’s a pretty expensive item.

Top_products = df.groupby(["Product Name"]).agg({'Sales':'sum'}).sort_values("Sales",ascending=False).head(8) 
Top_products = Top_products[["Sales"]].round(2)
Top_products.reset_index(inplace=True) 
total_revenue_products = '$' +str(int(Top_products["Sales"].sum())) 

plt.rcParams["figure.figsize"] = (13,7)
plt.rcParams['font.size'] = 12.0 
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99','#55B4B0','#E15D44','#009B77','#B565A7'] # colors are defined for the pie chart
explode = (0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05)
fig1, ax1 = plt.subplots()
ax1.pie(Top_products['Sales'], colors = colors, labels=Top_products['Product Name'], autopct= autopct_format(Top_products['Sales']), startangle=90,explode=explode)
centre_circle = plt.Circle((0,0),0.80,fc='white') 
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

ax1.axis('equal')  
label = ax1.annotate('Total Revenue \n of these products \n'+str(total_revenue_products),color = 'red', xy=(0, 0), fontsize=12, ha="center")
plt.tight_layout()
plt.show()

We completed our exploratory data analysis study based on the categorical variables in the data set. If you remember, we had two date type variables in the data set, Ship Date and Order Date. In the current stage, I want to analyze the time-dependent changes of our Sales variable according to the Order Date between these two variables. Next, I will once again use line charts to visualize how the Sales variable changes over time across different categorical variables.

df_time = df.copy()

df_time['Order Date'] = pd.to_datetime(df['Order Date'], format='%d/%m/%Y')
df_time['Ship Date'] = pd.to_datetime(df['Ship Date'], format='%d/%m/%Y')

df_time['Month-Year'] = df_time['Order Date'].dt.strftime('%b-%Y')

monthly_sales = df_time.groupby('Month-Year').agg({'Sales': 'sum'}).reset_index()
monthly_sales['Order'] = pd.to_datetime(monthly_sales['Month-Year'], format='%b-%Y')
monthly_sales = monthly_sales.sort_values('Order')


fig = px.line(monthly_sales, x='Month-Year', y='Sales', title='Monthly Sales',
              labels={'Month-Year': 'Month-Year', 'Sales': 'Sales'})

fig.update_traces(hovertemplate='Date: %{x}<br>Sales: %{y:,.2f}<extra></extra>')


fig.update_xaxes(tickangle=90, tickmode='array', tickvals=monthly_sales['Month-Year'],
                 ticktext=monthly_sales['Month-Year'])


fig.update_layout(width=900, height=600)


fig.show()

When we analyze the change in sales values over time, it initially appears that there is a trend in the series. However, it would be premature to draw conclusions based solely on appearance. To provide a more accurate interpretation, we can refer to the Augmented Dickey-Fuller (ADF) test. I will proceed with our time series analysis.

def time_breakdown(dataframe = dataset, date_col = 'Order Date',categorical_columns = cat_cols, threshold = 15):
    dataframe[f'Month_Year_{date_col}'] = dataframe[date_col].dt.strftime('%b-%Y')
    for cat_col in categorical_columns:
        if dataframe[cat_col].nunique() <= threshold:
            monthly_sales = dataframe.groupby([f'Month_Year_{date_col}',cat_col]).agg({'Sales': 'sum'}).reset_index()
            monthly_sales['Order'] = pd.to_datetime(monthly_sales[f'Month_Year_{date_col}'], format='%b-%Y')
            monthly_sales = monthly_sales.sort_values('Order')


            fig = px.line(monthly_sales, x=f'Month_Year_{date_col}', y='Sales', title='Monthly Sales',
              labels={f'Month-Year_{date_col}': f'Month_Year_{date_col}', 'Sales': 'Sales'}, color = cat_col)

            fig.update_traces(hovertemplate='Date: %{x}<br>Sales: %{y:,.2f}<extra></extra>')


            fig.update_xaxes(tickangle=90, tickmode='array', tickvals=monthly_sales[f'Month_Year_{date_col}'],
                     ticktext=monthly_sales[f'Month_Year_{date_col}'])


            fig.update_layout(width=1000, height=600)


            fig.show()

time_breakdown(df_time, 'Order Date', categorical_columns = categorical_columns, threshold = 10)

I create a function to analyze the variation in sales values over time based on the categorical variables in our dataset. This function efficiently presents the sales value breakdown of the specified categorical columns as a list, in relation to time.

Now, I am generating new time variables by utilizing the Order Date column we possess. Subsequently, I will explore the fluctuations in product sales over time.

df_time['WeekDay'] = df_time['Order Date'].dt.day_name()

start_date = df_time['Order Date'].min()
df_time['Week'] = ((((df_time['Order Date'] - start_date).dt.days)/7) +1).astype(int)

We have previously seen the Sales Change according to the Order Date variable. But I didn’t create this chart day by day. Now it’s time to recreate the chart.

df_plot = df_time.groupby('Order Date').agg({'Sales':'sum'}).rename(columns = {'Sales':'Total Sales'}).reset_index().sort_values(by = 'Order Date', ascending = True)

fig = px.line(df_plot, x="Order Date", y="Total Sales", title='Total Sales by Date')
fig.show()

As observed in the figure, there is no apparent pattern in sales concerning the number of orders. However, upon closer examination, we can identify a seasonality trend.When we create a line plot based on the Total Sales Value, we do not discern a clear seasonality pattern.Furthermore, there is no distinct daily trend in the data. We will conduct tests to ascertain the stationarity and seasonality of the data.My next objective is to investigate whether there are variations in sales on different days of the week.In other words, I plan to group the days of the week to determine if sales peak on specific days.

day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df_plot_dayofweek = df_time[['WeekDay', 'Sales', 'Order ID']].groupby(['WeekDay']).agg({'Sales': 'sum', 'Order ID': 'count'}).rename(columns = {'Sales': 'Total Sales', 'Order ID':'Order Count'}).reset_index()

df_plot_dayofweek = df_plot_dayofweek.set_index('WeekDay').loc[day_order].reset_index()

fig_dayofweek = px.bar(df_plot_dayofweek, x = 'WeekDay', y = 'Order Count',hover_data=['WeekDay', 'Order Count'], 
                       labels={'WeekDay': 'WeekDay', 'Order Count': 'Order Count'}, 
                       color = 'Order Count')
    
fig_dayofweek.update_traces(hovertemplate='WeekDay: %{x}<br>Order Count: %{y}<extra></extra>')

fig_dayofweek.update_layout(title='Total Sales by Day Of Week',width=1200, height=600)

fig_dayofweek.update_traces(text=df_plot_dayofweek['Order Count'], 
                  texttemplate='%{text:.2f}', 
                  textfont=dict(color='gray', size=14, family='Arial'),
                  textangle=0,textposition='outside')
fig_dayofweek.show()

As evident in the chart, sales experience peaks on certain Tuesdays and gradually decrease until Thursday. During the weekends, they begin to rise again.

day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df_plot_dayofweek = df_time[['WeekDay', 'Sales', 'Order ID']].groupby(['WeekDay']).agg({'Sales': 'sum', 'Order ID': 'count'}).rename(columns = {'Sales': 'Total Sales', 'Order ID':'Order Count'}).reset_index()

df_plot_dayofweek = df_plot_dayofweek.set_index('WeekDay').loc[day_order].reset_index()

fig_dayofweek = px.bar(df_plot_dayofweek, x = 'WeekDay', y = 'Total Sales',hover_data=['WeekDay', 'Total Sales'], 
                       labels={'WeekDay': 'WeekDay', 'Total Sales': 'Total Sales'}, 
                    color = 'Total Sales')
    
fig_dayofweek.update_traces(hovertemplate='WeekDay: %{x}<br>Total Sales: $%{y}<extra></extra>')

fig_dayofweek.update_layout(title='Total Sales by Day Of Week',width=1200, height=600)

fig_dayofweek.update_traces(text=df_plot_dayofweek['Total Sales'], 
                  texttemplate='$%{text:.2f}', 
                  textfont=dict(color='gray', size=14, family='Arial'),
                  textangle=0,textposition='outside')
fig_dayofweek.show()

When we examine the sales values and the number of products sold separately by day, we notice that the highest product sales occur on Tuesday, followed by Tuesday, Saturday, and Monday. However, in terms of revenue, the days with the highest product sales are Saturday, Tuesday, Monday, and Wednesday. Interestingly, Sunday, which ranks highest in terms of product sales, falls to a lower rank in terms of revenue.

Based on this observation, we can conclude that items with higher monetary value are typically ordered on Saturdays. Since Tuesday tops both charts, it implies that products sold on Tuesdays are not only numerous but also valuable. On Wednesdays, similar to Saturdays, there is a lower volume of orders, but these orders tend to be more valuable

Now I want to reduce my time series analysis to a slightly smaller lens and look at the course of sales over a year. I would like to continue my work by creating a variable named single_year to examine the sales made in 2017 at random from the series.

SINGLE_YEAR = 2017

df_sales_single_year = df_time[df_time['Order Date'].dt.year == SINGLE_YEAR]
df_plot_single_year = df_sales_single_year[['Order Date','Sales','Order ID']].groupby(['Order Date']).agg({'Sales': 'sum', 'Order ID':'count'}).rename(columns = {'Sales':'Total Sales', 'Order ID': 'Order Count'}).reset_index() 
fig = px.bar(df_plot_single_year, x='Order Date', y='Order Count',facet_col_spacing=0, title=f'Total Sales by Day {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600, width = 1200)
fig.show()

df_plot = df_sales_single_year[['Order Date','Category', 'Sales','Order ID']].groupby(['Order Date','Category']).agg({'Sales':'sum', 'Order ID':'count'}).rename(columns = {'Sales': 'Total Sales', 'Order ID': 'Order Count'}).reset_index()
fig = px.bar(df_plot, x='Order Date', y='Total Sales',color="Category", title=f'Item Sales by Date - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600, width = 1200, showlegend = True)
fig.show()

df_plot_weekly = df_sales_single_year[['Week','Category','Order ID','Sales']].groupby(['Week','Category']).agg({'Sales':'sum', 'Order ID': 'count'}).rename(columns = {'Sales': 'Total Sales', 'Order ID' : 'Order Count'}).reset_index()
fig = px.bar(df_plot_weekly, x='Week', y='Total Sales',color="Category", title=f'Total Sales by Week - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600 , width = 1200, showlegend = True)
fig.show()

day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

df_plot_daily = df_time[['WeekDay','Category','Sales']].groupby(['WeekDay','Category']).agg({'Sales':'sum'}).rename(columns = {'Sales':'Total Sales'}).reset_index()
df_plot_daily = df_plot_daily.set_index('WeekDay').loc[day_order].reset_index()
fig = px.bar(df_plot_daily, x='WeekDay', y='Total Sales',color="Category", title=f'Total Sales by Day of Week - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600 , width = 900, showlegend = True)
fig.show()

df_pop_cats = df_time[['Sub-Category', 'Sales', 'Order ID']].groupby(['Sub-Category']).agg({'Sales' : 'sum',"Order ID": lambda orderid: orderid.nunique()}).rename(columns = {'Order ID' : 'Order Amount', 'Sales' : 'Total Sales'}).reset_index().sort_values(by  = 'Total Sales', ascending = False)
df_pop_cats[:10]

df_plot_sc = df_sales_single_year[['Order Date','Sub-Category', 'Sales']].groupby(['Order Date','Sub-Category']).agg({'Sales':'sum'}).rename(columns = {'Sales': 'Total Sales'}).reset_index()
fig = px.bar(df_plot_sc, x='Order Date', y='Total Sales',color="Sub-Category", title=f'Item Sales by Date - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600, width = 900)
fig.show()

df_plot_sc_weekly = df_sales_single_year[['Week','Sub-Category','Sales']].groupby(['Week','Sub-Category']).agg({'Sales':'sum'}).rename(columns = {'Sales': 'Total Sales'}).reset_index()
fig = px.bar(df_plot_sc_weekly, x='Week', y='Total Sales',color="Sub-Category", title=f'Total Sales by Week - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600 , width = 900)
fig.show()

day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

df_plot_sc_daily = df_time[['WeekDay','Sub-Category','Sales']].groupby(['WeekDay','Sub-Category']).agg({'Sales':'sum'}).rename(columns = {'Sales':'Total Sales'}).reset_index()
df_plot_sc_daily = df_plot_sc_daily.set_index('WeekDay').loc[day_order].reset_index()
fig = px.bar(df_plot_sc_daily, x='WeekDay', y='Total Sales',color="Sub-Category", title=f'Total Sales by Day of Week - {SINGLE_YEAR}')
fig.update_layout(bargap=0.0,bargroupgap=0.0, height = 600 , width = 900)
fig.show()

When analyzing sales changes in the Phones category, which boasts the highest sales value among sub-categories, we notice that peak phone sales occurred on Monday and the following Tuesday. Thursday experienced the lowest phone sales due to a general decline in store sales, but it’s noteworthy that the phone subcategory still represented the largest share of daily sales on Thursdays.

Another distinct sales pattern emerges in the Machines subcategory, where machine sales appear to reach their peak on Wednesdays and Sundays.

Within the furniture section, the Chairs subcategory exhibits a consistent sales trend. Upon closer inspection, it becomes evident that Tuesdays and Saturdays are the days when the most chairs are sold.

Before proceeding to the modeling and preprocessing stages, I intend to evaluate the stationarity and seasonality of the time series data as part of the time series analysis. My aim is to determine whether any seasonality or stationarity exists within our data.

I completed my time series analysis studies based on the Order Date variable. Now I want to test the stationarity of the time series.

df_time['Month-Year'] = df_time['Order Date'].dt.strftime('%b-%Y')

monthly_sales = df_time.groupby('Month-Year').agg({'Sales': 'sum'}).reset_index()
monthly_sales['Order'] = pd.to_datetime(monthly_sales['Month-Year'], format='%b-%Y')
monthly_sales = monthly_sales.sort_values('Order')


fig = px.line(monthly_sales, x='Month-Year', y='Sales', title='Monthly Sales',
              labels={'Month-Year': 'Month-Year', 'Sales': 'Sales'})

fig.update_traces(hovertemplate='Date: %{x}<br>Sales: $%{y:,.2f}<extra></extra>')


fig.update_xaxes(tickangle=90, tickmode='array', tickvals=monthly_sales['Month-Year'],
                 ticktext=monthly_sales['Month-Year'])


fig.update_layout(width=900, height=600)


fig.show()

At first glance, we can readily observe a clear trend in the time series data. When we group the sales data by Month-Year, it becomes evident that sales are consistently increasing over time.

The other tests I will conduct include assessments for seasonality, cyclical patterns, and stagnation. In the initial overview, we observe that the time series data exhibits seasonality, reaching peak points at specific intervals. The concept of cyclicality is somewhat similar to seasonality, even though it may not have a substantial impact on forecasting processes. Understanding it can be valuable. It refers to a series repeating a certain pattern at specific intervals. For instance, we expect ice cream sales to rise during the summer months. If we were to model ice cream sales, we would find that they consistently follow the same pattern during the summer months.

plt.figure(figsize = (15,6))
plt.style.use('seaborn-colorblind')
plt.grid(True, alpha = 0.6)

sns.kdeplot(df.sort_values(by= ['Order Date'],  ascending=True).loc[(df['Order Date'] == '2015-03-24'), 'Sales'], label = '2015-March')
sns.kdeplot(df.sort_values(by= ['Order Date'],  ascending=True).loc[(df['Order Date'] == '2016-03-24'), 'Sales'], label = '2016-March')
sns.kdeplot(df.sort_values(by= ['Order Date'],  ascending=True).loc[(df['Order Date'] == '2017-03-24'), 'Sales'], label = '2017-March')

plt.xlim(left = 0, right = 5000)
plt.show()

As you can see, when I look at the distribution of same-day sales data in 3 different years, the data is not normally distributed. That’s why I want to use the median value instead of the average in the resample function to examine sales data within a day.

df['Order Date'] = pd.to_datetime(df['Order Date'], format='%d/%m/%Y')
df_sorted = df.sort_values(by= ['Order Date'],  ascending=True)

df_sorted = df_sorted.set_index("Order Date")


df_time = pd.DataFrame(df_sorted['Sales'])

df_time =  pd.DataFrame(df_time['Sales'].resample('D').median())
df_time = df_time.interpolate(method='linear')

from statsmodels.tsa.stattools import adfuller
adf = adfuller(df_time) 

print(adf)
print('\nADF = ', str(adf[0])) 
print('\np-value = ', str(adf[1]))
print('\nCritical Values: ')

for key, val in adf[4].items(): #for loop to print the p-value (1%, 5% and 10%) and their respective values
    print(key,':',val)

    if adf[0] < val:
        print('Null Hypothesis Rejected. Time Series is Stationary')
    else:
        print('Null Hypothesis Accepted. Time Series is not Stationary')

To test whether the series is stationary, we use the adfuller test in the statsmodels library. According to this test, we got the result that our data set is stationary. This is interesting because when we drew the line graph, we saw that the dataset was in an upward trend. Let’s do our other test..

import matplotlib.pyplot as plt

def is_stationary(y):

    # "HO: Non-stationary"
    # "H1: Stationary"

    p_value = sm.tsa.stattools.adfuller(y)[1]
    if p_value < 0.05:
        print(F"Result: Stationary (H0: non-stationary, p-value: \
          {round(p_value, 3)})")
    else:
        print(F"Result: Non-Stationary (H0: non-stationary, p-value: \
          {round(p_value, 3)})")

def ts_decompose(y, model="additive", stationary=False):
    result = sm.tsa.seasonal_decompose(y, model=model)
    fig, axes = plt.subplots(4, 1, sharex=True, sharey=False)
    fig.set_figheight(8)
    fig.set_figwidth(10)

    axes[0].set_title("Decomposition for " + model + " model")
    axes[0].plot(y, 'k', label='Original ' + model)
    axes[0].legend(loc='upper left')

    axes[1].plot(result.trend, label='Trend')
    axes[1].legend(loc='upper left')

    axes[2].plot(result.seasonal, 'g', label='Seasonality & Mean: ' + str(round(result.seasonal.mean(), 4)))
    axes[2].legend(loc='upper left')

    axes[3].plot(result.resid, 'r', label='Residuals & Mean: ' + str(round(result.resid.mean(), 4)))
    axes[3].legend(loc='upper left')
    plt.show(block=True)

    if stationary:
        is_stationary(y)

We now agree that the Dataset is stationary. With the tests we have done, we have seen that our interpretation of ‘The data set is in an upward trend’ is not true.

I would like to explain the above chart simply. In this image, the top chart provides us with the level information of the data set and allows us to make up our minds about whether the series is stationary or not. The second chart provides trend information, the third chart seasonality information. Finally, the Residuals graph shows the residuals. The residuals show the deviation levels around the data. In the expected case, the mean of the residues around zero is 0.

I finished to EDA & Time Series Analysis, I will build some ML or DL model. See u soon :)

Contact :

Linkedn

Kaggle

BECOME a WRITER at MLearning.ai // text-to-video // Detect AI img

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com