Importance of Case Studies 📚

Practicing DatScy
8 min readJul 25, 2023

Data Science techniques are always improving. I find that the more one practices ideas and workflows, they become repetitive. The ideas and workflows can be either summarized using a class/API or case study, depending on if the work product is code or a process. I started summarizing my projects into case studies after finishing the Google Data Analytics Professional Certificate, the specialization includes 8 courses and the last course entails completing two case studies using semi-large datasets.

I created a short BigQuery ingestion and statistical code library [1] to complete the case studies [2]. Case studies are also a great way to show employers project organization, problem solving, and coding ability; as someone constantly looking for Data Science work opportunities, case studies are a great summarized format for Upwork or freelance projects.

Below, is a short summary of how to make a case study with results; I used an exercise case study as an example. A case study includes the following sections; Google Analytics training suggests using 6 keywords (Ask, Prepare, Process, Analyze, Share, Act) to answer for each case study:

  1. Project Summary : a description of the work objective/problem, a phrase explaining how the problem was solved using the most simplest solution called the baseline solution, a phrase explaining what methodology was used to carry out the baseline solution, the results of the baseline solution, and suggestions for improving the project
  2. Business Objective (Ask): the business objective lists the problems that need to be solved with measurable objectives of success for each problem. For example, a model needs to be built using key features, such that prediction ability is 95% or more. Having a measurable objective is important because employers know directly which project steps are possible and not possible, with respect to planning; a failed project step is not a failure, it just means that expectations were incorrect. The first problem should correspond with the last problem (Act), because it shows that the main global problem was indeed solved by the specific problems.
  3. Project Background: a description of the current project situation, giving motivation for the work objective
  4. Baseline Solution: an explanation of how the problem was solved using the most simplest solution
  5. Project Deliverables (Prepare, Process, Analyze, Act): Prepare consists of describing data pre-processing steps, Process consists of describing the methodology for carrying out the analysis, Analysis describes a summary of each analysis result, and Act describes a result summary of the final business objective which is the work objective.
  6. Future Recommendations

Case Study 2 from Google Data Analytics Professional Certificate : Exercise

Project Summary

The objective of this work was to gain insight into how consumers are using their smart devices. The Baseline Solution is to statistically understand how each lifestyle group is behaving using five numerical features (mean steps, mean total distance, mean calories, mean heart rate, sleep duration). A two-step method is used to identify two categorical features: lifestyle, type of exercise. The methodology used to implement the baseline solution, is to use basic statistics to find user trends in smart watch data; kmeans was used to make distinct labels for exercise lifestyle and type of behavior performed.

Business Objective

Bellabeat needs to gain more insight into how customers use their smart watches. Measurable success for this task is identifying trends and user behavior that could be used for future marketing.

In order to understand user trends, a lifestyle label was created from existing data via kmeans, using total mean distance. The reliability of a given data collected lifestyle label using mean distance data, was compared with the kmeans label. The most reliable label, that identified lifestyle groups independently, was used for group comparative statistical analysis. Measurable success for this task is to obtain a list ranking weekday and weekend lifestyle groups with respect to the numerical features, and the statistical significance per group for each feature.

The type of behavior and/or exercise that users performed was investigated via kmeans using four statistically relevant numerical features (mean steps, total mean distance, mean calories, and mean heartrate). Measurable success for this task is to obtain a list ranking type of behavior groups with respect to the numerical features.

Act: Marketing can use weekday and weekend lifestyle and type of behavior categorical features to market exercise interest related material; sedentary users could be given encouragement to exercise more during

Project Background

Bellabeat is a company for women health products, there is a need to understand what activities the users are performing in order to create new future products and/or marketing strategies.

Baseline Solution

The Baseline Solution is to : 0) use mean total distance to make a lifestyle label with kmeans, 1) identify which numerical features are important with respect to the kmeans lifestyle label, 2) use the statistically significant features to predict type of exercise.

Project Deliverables: Analyze

Join two main tables in GCP BigQuery: activity and sleep tables.

bq query \
--location=$location \
--destination_table $PROJECT_ID:$dataset_name.$OUTPUT_TABLE_name \
--allow_large_results \
--use_legacy_sql=false \
'SELECT
T0.Id,
T0.ActivityDate,
T0.mean_steps,
T0.mean_total_distance,
T0.mean_active_distance,
T0.mean_moderateactive_distance,
T0.mean_lightactive_distance,
T0.mean_sedentary_distance,
T0.mean_fairlyactive_distance,
T0.mean_light_distance,
T0.mean_calories,
T0.mean_hr,
T15.TotalTimeInBed AS sleep_duration,
FROM `'$PROJECT_ID'.'$dataset_name'.exercise_full_clean0` AS T0
FULL JOIN `'$PROJECT_ID'.'$dataset_name'.sleepDay_merged` AS T15 ON T0.Id = T15.Id;'

Perform kmeans to obtain lifestyle label, with the GCP_bigquery_case_study_library script [1]!

kmeans(){

# Inputs:
# $1 = location
# $2 = PROJECT_ID
# $3 = dataset_name
# $4 = TRAIN_TABLE_name
# $5 = TESTING_TABLE_name
# $6 = MODEL_name
# $7 = PREDICTED_results_TABLE_name
# $8 = NUM_CLUSTERS

# Bigquery needs numerical features and labels

# Ensure that the OUTPUT model does not already exist, to prevent saving errors
bq rm -f --model $2:$3.$6


bq query \
--location=$1 \
--allow_large_results \
--use_legacy_sql=false \
'CREATE MODEL '$3'.'$6'
OPTIONS(model_type="KMEANS", NUM_CLUSTERS='$8',
KMEANS_INIT_METHOD="KMEANS++", MAX_ITERATIONS=50,
early_stop=TRUE, MIN_REL_PROGRESS=0.001,
WARM_START=FALSE, DISTANCE_TYPE="COSINE") AS
SELECT *
FROM `'$2'.'$3'.'$4'`'

# Evaluate
evaluate_model $1 $2 $3 $4 $6

# Prédire des nouvelle etiquettes
predict_with_model $1 $2 $3 $5 $6 $7

}
bq query \
--location=$location \
--destination_table $PROJECT_ID:$dataset_name.$TRAIN_TABLE_name \
--allow_large_results \
--use_legacy_sql=false \
'SELECT mean_total_distance AS feature,
CASE WHEN lifestyle = "Sedentary" THEN 1 WHEN
lifestyle = "Moderate_Active" THEN 2 WHEN
lifestyle = "Light_Active" THEN 3 WHEN
lifestyle = "Active" THEN 4 WHEN
lifestyle = "Fairly_Active" THEN 4 WHEN
lifestyle = "Light" THEN 4 END AS label
FROM `'$PROJECT_ID'.'$dataset_name'.exercise_full_clean3`'

export MODEL_name=$(echo "kmeans_model_label")

export NUM_CLUSTERS=$(echo "4") # Sedentary, Moderate, Light, Active
kmeans $location $PROJECT_ID $dataset_name $TRAIN_TABLE_name $TRAIN_TABLE_name $MODEL_name $PREDICTED_results_TABLE_name $NUM_CLUSTERS

Figure out which lifestyle group corresponds to each kmeans label. Use a JOIN to create a NEW TABLE! Then select the desired columns.

bq query \
--location=$location \
--allow_large_results \
--use_legacy_sql=false \
'WITH tabtemp AS(
SELECT lifestyle, lifesyle_NUM, kmeans_label, COUnt(*) AS count
FROM `'$PROJECT_ID'.'$dataset_name'.exercise_full_clean5`
Group by
lifesyle_NUM, lifestyle, kmeans_label
Order by
lifesyle_NUM
)
SELECT T0.lifestyle,
T0.lifesyle_NUM,
T0.kmeans_label,
T1.max_count FROM tabtemp T0
JOIN (SELECT lifesyle_NUM, MAX(count) AS max_count FROM tabtemp GROUP BY lifesyle_NUM) AS T1 ON T0.lifesyle_NUM = T1.lifesyle_NUM AND T0.count = T1.max_count'

# Output result in terminal:
# +-----------------+--------------+--------------+-----------+
# | lifestyle | lifesyle_NUM | kmeans_label | max_count |
# +-----------------+--------------+--------------+-----------+
# | Sedentary | 1 | 3 | 724950 |
# | Light_Active | 3 | 1 | 51435 |
# | Moderate_Active | 2 | 3 | 52650 |
# | Active | 4 | 2 | 8100 |
# +-----------------+--------------+--------------+-----------+
# We assign kmeans_label as:
# 1 = Light_Active, 2 = Active, 3 = Sedentary 4 = Moderate_Active

Statistical Analysis : determine the average (numerical) feature value per category over a span of time (weekend, weekday)

Perform one sample z-statistic with the GCP_bigquery_statistic_library [1] script function ONE_SAMPLE_TESTS_zstatistic_per_row!

ONE_SAMPLE_TESTS_zstatistic_per_row(){

# [0] z_statistic_ONE_SAMPLE : Comparing the sample population mean
# with the population mean (large sample size populations)

# Ideal for the case for a column of probabilities for an event occurence,
# and one would like to know which probablistic events are statistically
# significant to occur with respect to the other events.
# Note that the length is set to one, thus the t and z-statistic will be
# computed per row.

# Inputs:
# $1 = location
# $2 = prob_perc
# $3 = PROJECT_ID
# $4 = dataset_name
# $5 = TABLE_name_probcount


bq query \
--location=$1 \
--allow_large_results \
--use_legacy_sql=false \
'CREATE TEMP FUNCTION z_statistic_ONE_SAMPLE(samp1_mean FLOAT64, pop_mean FLOAT64, pop_std FLOAT64, samp1_len INT64)
AS (
(samp1_mean - pop_mean)/(pop_std/SQRT(samp1_len))
);


WITH shorttab2 AS
(
SELECT *,
(row_num*0)+(SELECT AVG(CAST('$2' AS FLOAT64)) FROM `'$3'.'$4'.'$5'`) AS pop_mean,
(row_num*0)+(SELECT STDDEV(CAST('$2' AS FLOAT64)) FROM `'$3'.'$4'.'$5'`) AS pop_std
FROM `'$3'.'$4'.'$5'`
)
SELECT *,
z_statistic_ONE_SAMPLE('$2', pop_mean, pop_std, 1) AS z_critical_onesample
FROM shorttab2'


# Clean-up and delete old table
bq rm -t $PROJECT_ID:$dataset_name.$TABLE_name_probcount
}
export TABLE_name=$(echo "exercise_full_clean6")
export TABLE_name_probcount=$(echo "TABLE_name_probcount_NUMERICAL")
declare -a NUM_FEATS=('mean_steps' 'mean_total_distance' 'mean_calories' 'mean_hr' 'sleep_duration');

for samp1_FEAT_name in "${NUM_FEATS[@]}"
do
echo "Numerical feature:"
echo $samp1_FEAT_name

# Ensure that no TABLE_name_probcount does not exist in the dataset
bq rm -t $PROJECT_ID:$dataset_name.$TABLE_name_probcount

# Calculation of percentage/probability of occurence of a numerical feature (workout_minutes) for a bin_number [ie: days (weekday=5, weekend=2)] across all samples
bq query \
--location=$location \
--destination_table $PROJECT_ID:$dataset_name.$TABLE_name_probcount \
--allow_large_results \
--use_legacy_sql=false \
'WITH tab2 AS
(
SELECT *,
(SELECT AVG('$samp1_FEAT_name')/AVG(bin_number) FROM `'$PROJECT_ID'.'$dataset_name'.'$TABLE_name'` WHERE wday ="weekend") AS pop_weekend
FROM `'$PROJECT_ID'.'$dataset_name'.'$TABLE_name'`
)
SELECT ROW_NUMBER() OVER(ORDER BY '$category_FEAT_name_1samptest') AS row_num, '$category_FEAT_name_1samptest', wday, (AVG('$samp1_FEAT_name')/AVG(bin_number))/AVG(pop_weekend) AS prob_perc
FROM tab2
GROUP BY wday, '$category_FEAT_name_1samptest'
ORDER BY wday, '$category_FEAT_name_1samptest';'



export prob_perc=$(echo "prob_perc") # name of numerical column to find z-statistic values per row
ONE_SAMPLE_TESTS_zstatistic_per_row $location $prob_perc $PROJECT_ID $dataset_name $TABLE_name_probcount

done

Finally, using kmeans again to find a type of exercise label.

export MODEL_name=$(echo "kmeans_model_predict_exercise_type")

export NUM_CLUSTERS=$(echo "3")

kmeans $location $PROJECT_ID $dataset_name $TRAIN_TABLE_name $TRAIN_TABLE_name $MODEL_name $PREDICTED_results_TABLE_name $NUM_CLUSTERS

bq query \
--location=$location \
--destination_table $PROJECT_ID:$dataset_name.$OUTPUT_TABLE_name \
--allow_large_results \
--use_legacy_sql=false \
'SELECT CENTROID_ID,
AVG(mean_steps_norm) AS steps,
AVG(mean_total_distance_norm) AS distance,
AVG(mean_calories_norm) AS calories,
AVG(mean_hr_norm) AS hr
FROM `northern-eon-377721.google_analytics_exercise.kmeans_predict_exercise_type`
GROUP BY CENTROID_ID
ORDER BY steps, distance, calories, hr, CENTROID_ID DESC'

# Query to match the clusters to the features, so that I can know which cluster is

# Run/Walk (high steps, high distance, high calories, low hr)=2 ,
# Sedentary/low aerobic (low steps, low distance, low calories, high hr) =3 and 1

# +-------------+---------------------+----------------------+---------------------+---------------------+
# | CENTROID_ID | steps | distance | calories | hr |
# +-------------+---------------------+----------------------+---------------------+---------------------+
# | 3 | -0.429742850050325 | -0.41945225503768796 | -0.2758035817795051 | 0.08928496956032417 |
# | 1 | -0.4196294211903514 | -0.41027305043508516 | -0.2582096080946233 | 0.04740478504750922 |
# | 2 | 1.925752435154324 | 1.8809878065450987 | 1.2142717700912848 | -0.325985199489712 |
# +-------------+---------------------+----------------------+---------------------+---------------------+

bq with bash is a great way to perform ingestion of datasets and database queries quickly. From a modeling and coding perspective, preparing case studies may seem time consuming and boring, but it is important to know how to convey results in a clear and concise manner.

Happy Practicing! 👋

🎁 Donate | 💻 GitHub | 🔔 Subscribe

Correction of associations

WRITER at MLearning.ai

References

  1. GCP BigQuery ingestion and statistical code library: https://github.com/j622amilah/GCP_ingestion_analysis_tools
  2. Case study library: https://github.com/j622amilah/Case_Studies

--

--

Practicing DatScy

Practicing coding, Data Science, and research ideas. Blog brand: Use logic in a clam space, like a forest, and use reliable Data Science workflows!