The dashboard I did to land my first business intelligence internship
Learn how to do an ETL process with Google Trends, Python, PostgreSQL and PowerBI.
1. INTRODUCTION
In the age of technology and changing consumer habits, the car industry in South America is at a critical point where traditional methods and modern digital marketing strategies come together. This is where exciting opportunities for growth and efficiency.
Our project aims to analyze the automotive industry in five South American countries — Peru, Colombia, Argentina, Bolivia, and Chile, focusing on various brands between 2021 and 2022. This analysis, presented through a dashboard such as Figure 2, seeks to provide insights for digital marketing purposes.
The article will be presented in 5 sections, which will be described as follows:
- Section 1: Brief description that acts as the motivating foundation of this article.
- Section 2: Explanation of the ETL diagram for the project.
- Section 3: The technical section for the project where Python and pgAdmin4 will be used.
- Section 4: Reporting data for the project insights.
- Section 5: Project findings emphasizing the importance of visualization tools, digital marketing and suggesting future areas for study.
You can visit my GitHub in this link for the full code of the project: AlexRoman938/brand_car_dashboard (github.com)
2. ETL ARCHITECTURE DIAGRAM
ETL stands for Extract, Transform, Load. It is a data integration process that involves extracting data from various sources, transforming it into a consistent format, and loading it into a target system. ETL ensures data quality and enables analysis and reporting.
2.1. Extract Data
We will use Google Trends as a database to extract data, it is a public web-based tool that allows users to explore the popularity of search queries on Google. It can be used in Data Analytics projects to gather insights about the popularity of specific topics.
In this case, we are going to use to search car brand such as Kia, Mitsubishi, Peugeot, Fuso, Chery, MG and GAC Motor in some countries in South America such as Argentina, Bolivia, Chile, Colombia, and Peru, between 01–01–2021 and 31–12–2022.
2.2. Transform Data
Once we extracted data from it, we will have multiple dataframes for each car brand and country (i.e., dataframe for kia searches in Peru or MG searches in Colombia). Next, we need to concat all of them into one dataframe called “final dataframe” and make some feature engineering on it to climb to the next section.
2.3. Load Data
After the transform process we will load that “final dataframe” into pgadmin4, pgAdmin is an open-source administration and development platform for PostgreSQL.
2.4. Reporting Data
Finally, we will connect pgadmin4 and power bi to make an interactive dashboard.
3. CODING STAGE
In this stage we are going to code in Python 3.9 using Anconda Environment. So, we need to write the below command in Anaconda Navigator’s Prompt:
conda create -n project_car python=3.9
Where “project_car” is the name of our environment (You can change it).
Next, we have to go into to the project environment:
conda activate project_car
3.1. Extract Data Code
Once we set up our anaconda environment. We will use pytrends library, it is a Python package that provides a simple interface for interacting with the Google Trends API.
So, we need to write the below command:
pip install pytrends
First, we have to initialize all arguments (These were shown in Section 2.1).
from pytrends.request import TrendReq
if __name__ == '__main__':
"""
It sets up a custom User-Agent header for making HTTP requests using
the requests library in Python.
The User-Agent header is part of the HTTP protocol and is used to
identify the client making the request. It typically includes
information about the client's operating system, browser, and version.
"""
requests_args = {
'headers': {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
}
pytrends = TrendReq(requests_args= requests_args)
"""
ARGUMENTS THAT WE WANT TO SEARCH AND FILTER ♪
"""
search_terms = ["Kia", "Mitsubishi", "Peugeot", "Fuso", "Chery", "MG", "GAC Motor"]
countries_filt = ["AR", "BO", "CL", "CO", "PE"]
start_date = '2021-01-01'
end_date = '2022-12-31'
Where AR is Argentina, BO is Bolivia, CL is Chile, CO is Colombia and PE is Peru.
Next, we have to create a main function to search and filter of the previous argument…
from pytrends.request import TrendReq
import time
def main(search_terms, countries_filt, start_date, end_date, conn_string):
"""
INPUT: ALL VARIABLES THAT WE WANT TO SEARCH AND FILTER
TEMPORARY OUTPUT: MULTIPLE DATAFRAMES IN "df_list" (These were mentioned in Section 2.2)
"""
df_list = []
for search_term in search_terms:
for country in countries_filt:
#cat is 12 because we want related searches in the automobile industry.
pytrends.build_payload(kw_list=[search_term], cat=12, timeframe= f'{start_date} {end_date}', geo= country, gprop='') #cat = 12 for Autos & Vehicles
interest_over_time_df = pytrends.interest_over_time()
interest_over_time_df = interest_over_time_df.reset_index()
interest_over_time_df["Country"] = country
interest_over_time_df["Brand"] = search_term
df_list.append(interest_over_time_df)
"""
time.sleep(n) is used to introduce a delay or pause
in the execution of a program for a specified number of seconds.
In this case, This helps us pause for 3 seconds to avoid
overwhelming the Google Trends API.
"""
time.sleep(3)
if __name__ == '__main__':
requests_args = {
'headers': {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
}
pytrends = TrendReq(requests_args= requests_args)
search_terms = ["Kia", "Mitsubishi", "Peugeot", "Fuso", "Chery", "MG", "GAC Motor"]
countries_filt = ["AR", "BO", "CL", "CO", "PE"]
start_date = '2021-01-01'
end_date = '2022-12-31'
main(search_terms, countries_filt, start_date, end_date, conn_string)
3.2. Transform Data
In this section we are going to use pandas library concat all dataframes of df_list and make some feature engineering. So move on… :D
from pytrends.request import TrendReq
import time
import pandas as pd
def main(search_terms, countries_filt, start_date, end_date, conn_string):
"""
INPUT: ALL VARIABLES THAT WE WANT TO SEARCH AND FILTER
TEMPORARY OUTPUT: A UNIQUE FINAL DATAFRAME
WITH THE BELOW COLUMNS "COUNTRY, BRAND, #NUMBER (NUMBER SEARCH)
"""
df_list = []
for search_term in search_terms:
for country in countries_filt:
#cat is 12 because we want related searches in the automobile industry.
pytrends.build_payload(kw_list=[search_term], cat=12, timeframe= f'{start_date} {end_date}', geo= country, gprop='') #cat = 12 for Autos & Vehicles
interest_over_time_df = pytrends.interest_over_time()
interest_over_time_df = interest_over_time_df.reset_index()
interest_over_time_df["Country"] = country
interest_over_time_df["Brand"] = search_term
df_list.append(interest_over_time_df)
time.sleep(3)
#Concat all dataframes of df_list
final_df = pd.concat(df_list, axis=0)
"""
Featuring engineering
"""
#Drop
final_df.drop(columns=['date', 'isPartial'], index= 0 ,inplace = True)
#Groupby
report = final_df.groupby(['Country', 'Brand']).agg({'Kia' : 'sum',
'Mitsubishi' : 'sum', 'Peugeot' : 'sum',
'Fuso' : 'sum', 'Chery' : 'sum', 'MG' : 'sum','GAC Motor' : 'sum'})
#Off index
report.reset_index(inplace = True)
#Fixed table
selected_columns = ['Kia','Mitsubishi', 'Peugeot', 'Fuso', 'Chery',
'MG', 'GAC Motor']
row_sum = report.loc[:, selected_columns].sum(axis =1)
report['#Number'] = row_sum.values
#Drop columns that I won't use anymore
report.drop(columns = selected_columns, axis = 0, inplace = True)
#Change name of variables in the dataframe
report['Country'] = report['Country'].replace({'AR':'Argentina', 'BO' : 'Bolivia', 'CL' : 'Chile', 'CO' : 'Colombia',
'PE' : 'Peru'})
if __name__ == '__main__':
requests_args = {
'headers': {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
}
pytrends = TrendReq(requests_args= requests_args)
search_terms = ["Kia", "Mitsubishi", "Peugeot", "Fuso", "Chery", "MG", "GAC Motor"]
countries_filt = ["AR", "BO", "CL", "CO", "PE"]
start_date = '2021-01-01'
end_date = '2022-12-31'
main(search_terms, countries_filt, start_date, end_date, conn_string)
3.3. Load Data
This section could be more technical, as we need to install pgadmin4. Well, we can search on Youtube “how to install pgAdmin4 on [by our OS system such as Windows, Linux or macOS]” and we will find many tutorials.
Once we’ve installed pgAdmin4 in our system. We have to create a database for the project:
Next, we have to write database’s name and save💾.
Excellent! 😍 We set up our database in pgadmin4. But for the moment our database is empty! 😮.
So, it’s time to load “df_final” to pgAdmin4.
from pytrends.request import TrendReq
import time
import pandas as pd
def main(search_terms, countries_filt, start_date, end_date, conn_string):
"""
INPUT: ALL VARIABLES THAT WE WANT TO SEARCH AND FILTER
OUTPUT: LOADING "FINAL_DATAFRAME" TO DATABASE
WITH THE BELOW COLUMNS "COUNTRY, BRAND, #NUMBER (NUMBER SEARCH)
"""
df_list = []
for search_term in search_terms:
for country in countries_filt:
pytrends.build_payload(kw_list=[search_term], cat=12, timeframe= f'{start_date} {end_date}', geo= country, gprop='') #cat = 12 for Autos & Vehicles
interest_over_time_df = pytrends.interest_over_time()
interest_over_time_df = interest_over_time_df.reset_index()
interest_over_time_df["Country"] = country
interest_over_time_df["Brand"] = search_term
df_list.append(interest_over_time_df)
time.sleep(3)
final_df = pd.concat(df_list, axis=0)
final_df.drop(columns=['date', 'isPartial'], index= 0 ,inplace = True)
report = final_df.groupby(['Country', 'Brand']).agg({'Kia' : 'sum',
'Mitsubishi' : 'sum', 'Peugeot' : 'sum',
'Fuso' : 'sum', 'Chery' : 'sum', 'MG' : 'sum','GAC Motor' : 'sum'})
report.reset_index(inplace = True)
selected_columns = ['Kia','Mitsubishi', 'Peugeot', 'Fuso', 'Chery',
'MG', 'GAC Motor']
row_sum = report.loc[:, selected_columns].sum(axis =1)
report['#Number'] = row_sum.values
report.drop(columns = selected_columns, axis = 0, inplace = True)
report['Country'] = report['Country'].replace({'AR':'Argentina', 'BO' : 'Bolivia', 'CL' : 'Chile', 'CO' : 'Colombia',
'PE' : 'Peru'})
#Loading to pgAdmin4
report.to_sql(name = "2021_2022_vehicles", con = conn_string, if_exists="replace")
if __name__ == '__main__':
"""
conn_string is used to establish a connection to PostgreSQL database.
Remember to edit by our username and database_name
"""
username = "snow"
database_name = "car_project"
conn_string = f"postgresql://postgres:{username}@localhost:5432/{database_name}"
requests_args = {
'headers': {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
}
pytrends = TrendReq(requests_args= requests_args)
search_terms = ["Kia", "Mitsubishi", "Peugeot", "Fuso", "Chery", "MG", "GAC Motor"]
countries_filt = ["AR", "BO", "CL", "CO", "PE"]
start_date = '2021-01-01'
end_date = '2022-12-31'
main(search_terms, countries_filt, start_date, end_date, conn_string)
NOW IT’S TIME TO RUN THE CODE…!!! 🏃🏃🏃
Congratulations!!! 🎊 We loaded “final_df” to our database in pgAdmin4. We can show it in tables.
4. Reporting Data
In this section, we have to download, connect and analyze the data on PowerBI. Therefore, for the sake of brevity, we have to download the file brand_cars_dashboard.pbix from the project’s GitHub repository.
4.1. Download File
Once we’re in the project’s GitHub repository, we need to click on “brand_cars_dashboard.pbix”.
Now, we have to click on the icon of “download”.
4.2. Connection pgAdmin4 and PowerBI
Once we’ve downloaded the file. First, we have to connect our database in pgAdmin4 with our PowerBI file:
Step 1: We’re going to open our PowerBI file. Moreover, in “Home” view → Click on“Get Data” → Click on“More…”
Step 2: First we have to type “postgre”. Next, we select “PostgreSQL database” and click on“Connect”.
Step 3: First, we have to type on Sever and Database: “localhost” and “car_project” (It’s database’s name). Next, click on “OK”.
Step 4: We have to check on the “empty checkbox” and click on “Load”.
Once we’ve clicked on “Load”, Power BI will connect with pgAdmin4. Finally, it will show us the data.
4.3. Data Analysis
It’s time for thinking… How can we get insight from our data?
We could start with basic questions:
Q1: How many searches in total are there?
Q2: How many searches (%) by country are there?
Q3: How many searches by brand are there?
Q4: How many searches by brand and country are there?
Q5: Summarize Data
Q1: How many searches in total are there?
We have to click on card visualization… because it displays a single value or measure to show our KPI.
We can see there are 85K searches in total for the users.
Q2: How many searches (%) by country are there?
We have to click on Pie chart visualization… because we want to show a percentage (%) value.
We can see the most searches by Country is Chile (35.16%).
Q3: How many searches by brand are there?
We have to click on Stacked bar chart visualization… because we want to divide into segments our data.
We can see the most searches by Brand is MG (greater than 20K) for the users.
Q4: How many searches by brand and country are there?
We have to click on Clustered column chart visualization… because we want an easily graphs to show easy comparison between multiple categories and their respective values.
We can see on Argentina and Chile there are a lot of searches about MG and Peugeot.
Q5: Summarize Data
We have to click on Matrix… because we want to summarize the data in a structured format similar to a spreadsheet.
We can see a summarized table 😆.
If you want to know more about data analysis you can visit my profile and read my previous stories: [TUTORIAL] A BASIC DATA ANALYSIS WITH POSITIVES COVID 19 PERU USING PYTHON | by Alexander Roman | MLearning.ai | Medium and Exploratory Data Analysis (EDA) on MyAnimeList data | by Alexander Roman | MLearning.ai | Medium
5. CONCLUSION AND RECOMMENDATION
Power BI is really important for visualization tools, and it connects easily with PostgreSQL databases. It has an easy-to-use interface and strong abilities to turn data into useful information. When combined with PostgreSQL, it helps businesses make dynamic visuals, improve decision-making, and succeed with data-driven strategies.
During the analysis, we aimed to ascertain the search trends for brands and cards between 2021 and 2022 in certain countries within South America. Based on the outcomes, we are able to propose the following insights:
- Utilize paid advertisements on platforms like Google Ads, Facebook Ads, or Instagram Ads to reach specific audiences interested in the MG and Peugeot car brands in Argentina and Chile, as these brands have a significant number of searches in these countries (Figure 20).
- Implement content marketing for the MG brand by sharing relevant content to engage the entire audience interested in the brand, considering it’s the most searched brand overall (Figure 19).
- Invest efficiently in digital marketing campaigns such as advertising, social media, etc., for the MG, Peugeot, KIA, Mitsubishi, and Chery brands across different South American countries due to the high user searches, exceeding 10,000 searches, compared to other brands (Figure 21).
- Utilize influencer marketing in Chile, where searches overall account for 35.16% of the total (Figure 18). By collaborating with influencers passionate about automobiles, we can harness this trend to authentically and effectively present our brands, directly reaching a highly interested and engaged audience in the automotive world.
As a direction for future work, we suggest employing alternative data extraction methods to enhance the quality and quantity of the data. Additionally, using different visualization techniques such as box plots, scatter plots, treemaps, etc., is recommended to discover improved representations and insights for subsequent marketing campaigns.
Finally, thank you for reading this article. If you would like to contact me. This is my LinkedIn: Alexander Daniel Roman Gabriel | LinkedIn 🙌
REFERENCES
- Google Trends
- PostgreSQL: The world’s most advanced open source database
- pgAdmin — PostgreSQL Tools
- Data Visualization | Microsoft Power BI