Exploratory Data Analysis on Stock Market Data

Uzair Adamjee
7 min readMar 4, 2023
Photo by Lukas Blazek on Unsplash

Exploratory Data Analysis (EDA) is a crucial step in data science projects. It helps in understanding the underlying patterns and relationships in the data. In this tutorial, we will perform EDA on the S&P 500 dataset using Python.

1. Load the Dataset

The first step is to load the dataset. We will be using the S&P 500 dataset, which contains stock prices of the 500 largest publicly traded companies in the United States. The dataset can be downloaded from Kaggle.

import pandas as pd

# Load the dataset
df = pd.read_csv('SP500.csv')

2. Explore the Dataset

The next step is to explore the dataset. We can use the head() method to view the first few rows of the dataset.

# View the first few rows of the dataset
df.head()
# View the data types and non-null values in the dataset
df.info()

From the output, we can see that the dataset has 16924 rows and 6 columns. The columns are date, open, high, low, close, and volume. All columns have non-null values.

3. Data Cleaning

The next step is to clean the data.

  1. Check for missing values: Use the isnull() method to check for missing values in the dataset. If there are any missing values, decide whether to remove them or fill them with appropriate values.
  2. Check for duplicates: Use the duplicated() method to check for any duplicate rows in the dataset. If there are any duplicate rows, decide whether to remove them or keep them.
  3. Convert data types: Check if the data types of each column are appropriate for analysis. For example, the date column should be in datetime format instead of a string.
  4. Rename columns: Rename the columns if necessary for better readability and understanding.
  5. Remove irrelevant columns: Remove any columns that are not necessary for analysis or are redundant.
  6. Handle outliers: Check for outliers in the data and decide whether to remove them or keep them.
  7. Standardize data: Standardizing the data helps to bring all the columns to the same scale, which makes it easier to compare the columns. You can standardize the data using the StandardScaler class from the sklearn.preprocessing module.
import pandas as pd

# Load the dataset
df = pd.read_csv('SP500.csv')

# Check for missing values
print(df.isnull().sum())

#drop nan rows
df = df.dropna()

# Convert the date column to datetime format
df['date'] = pd.to_datetime(df['date'])

# Check for duplicates
print(df.duplicated().sum())

# remove any duplicate rows
df.drop_duplicates(keep=False, inplace=True

# Convert the date column to datetime format
df['date'] = pd.to_datetime(df['date'])

# Rename columns
df.rename(columns={'open': 'Open', 'high': 'High', 'low': 'Low', 'close': 'Close', 'volume': 'Volume'}, inplace=True)

# Remove irrelevant columns
df.drop(['High', 'Low', 'Volume'], axis=1, inplace=True)

# Handle outliers
q1 = df['Close'].quantile(0.25)
q3 = df['Close'].quantile(0.75)
iqr = q3 - q1
upper_bound = q3 + 1.5 * iqr
df = df[df['Close'] <= upper_bound]

#standardize data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['open', 'high', 'low', 'close', 'volume']] = scaler.fit_transform(df[['open', 'high', 'low', 'close', 'volume']])

After performing the above additional data cleaning steps, you can move on to the next step in EDA, which is to visualize the data.

4. Data Visualization

The next step is to visualize the data. We can use various types of plots to visualize the patterns and relationships in the data. Here, we will use the matplotlib and seaborn libraries to create plots.

#import libraries first
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Line Plot: We can create a line plot of the closing stock prices over time using the matplotlib library.

# Line chart of closing stock price over time
plt.figure(figsize=(10, 6))
sns.lineplot(x='date', y='close', data=df)
plt.title('Closing Stock Price Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Stock Price')
plt.show()

From the plot, we can see that the closing stock prices have increased over time, with some fluctuations.

Box Plot: We can create a box plot of the closing stock prices by year using the seaborn library.


df['year'] = df['date'].dt.year
sns.boxplot(x='year', y='close', data=df)
plt.title('Closing Stock Prices by Year')
plt.xlabel('Year')
plt.ylabel('Closing Stock Price')
plt.show()

From the plot, we can see that the closing stock prices have generally increased over the years, with some outliers.

Heatmap: We can create a heatmap to visualize the correlation between the stock prices using the seaborn library.

# Create a heatmap of the correlation between stock prices
corr = df[['open', 'high', 'low', 'close']].corr()
plt.figure(figsize=(8,8))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Between Stock Prices')
plt.show()

From the heatmap, we can see that the opening and closing prices have a strong positive correlation, while the low and high prices have a weaker positive correlation.

We can start by visualizing the distribution of the target variable, which in this case is the closing stock price. We can use a histogram to visualize the distribution.

plt.figure(figsize=(10, 6))
sns.histplot(df['close'], kde=True)
plt.title('Distribution of Closing Stock Price')
plt.xlabel('Closing Stock Price')
plt.ylabel('Frequency')
plt.show()

We can also visualize the stock prices for individual companies using a line chart. We can filter the dataset based on the company name and plot the closing stock price over time.

# Line chart of closing stock price for a specific company over time
company_name = 'MAT'
company_df = df[df['name']==company_name]
plt.figure(figsize=(10, 6))
sns.lineplot(x='date', y='close', data=company_df)
plt.title(f'Closing Stock Price of {company_name} Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Stock Price')
plt.show()

We can also visualize the stock prices for multiple companies using a line chart. We can use a loop to filter the dataset based on the company name and plot the closing stock price over time.

# Line chart of closing stock price for multiple companies over time
companies = ['MAT', 'MAA', 'CERN']
plt.figure(figsize=(10, 6))
for company in companies:
company_df = df[df['name']==company]
sns.lineplot(x='date', y='close', data=company_df, label=company)
plt.title('Closing Stock Price of Multiple Companies Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Stock Price')
plt.legend()
plt.show()

To visualize the daily returns, we can create a line chart using the pandas library. The line chart will show the percentage change in price from one day to the next.

daily_returns = df['close'].pct_change()

# Create a line chart of the daily returns
plt.plot(daily_returns.index, daily_returns.values)
plt.title('S&P 500 Daily Returns')
plt.xlabel('Date')
plt.ylabel('Daily Return')
plt.show()
#This will create a line chart showing the daily returns over time.

We can use a combination chart to visualize the stock prices with the volume traded.

# Create a combination plot of stock prices and volume traded
plt.figure(figsize=(12,6))
sns.lineplot(x='date', y='close', data=df, color='b')
sns.lineplot(x='date', y='volume', data=df, color='g', alpha=0.5)
plt.title('S&P 500 Stock Prices with Volume Traded')
plt.xlabel('Year')
plt.ylabel('Price/Volume')
plt.legend(['Closing Price', 'Volume'])
plt.show()

We can also use a histogram to visualize the daily returns.

# Create a histogram of the daily returns
plt.figure(figsize=(12,6))
sns.histplot(df['close'].pct_change().dropna(), bins=100, kde=True)
plt.title('S&P 500 Daily Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency')
plt.show()

In conclusion, Exploratory Data Analysis (EDA) is an essential step in data science projects, and it provides a better understanding of the underlying patterns and relationships in the data. In this tutorial, we performed EDA on the S&P 500 dataset using Python.

We started by loading the dataset and exploring it to understand its structure and contents. We also performed data cleaning to ensure that the data was in the correct format and removed any missing or duplicated values.

After that, we visualized the data using various charts, such as line plots, heat maps, and histograms. We visualized the stock prices, returns, and trading volumes of the S&P 500 companies over time.

Overall, the visualizations showed that the S&P 500 index has steadily grown over time, with some periods of volatility. The returns of individual companies have also varied, with some outperforming the index, and others underperforming. Additionally, we observed a correlation between the trading volumes and stock prices of the companies.

EDA provides valuable insights into the data, and the visualizations help to communicate these insights effectively. By performing EDA, we can identify trends, patterns, and relationships that may not be immediately apparent from the data. This knowledge can then be used to inform further analysis and decision-making.

Thank you for reading!

Let’s Connect on other platforms :

BECOME a WRITER at MLearning.ai

--

--

Uzair Adamjee

Unleash the power of data with my insights! Join me on a journey to explore the untold stories hidden in numbers. Let's dive in together.