Data Science Project — Predictive Modeling on Biological Data

Shreya U. Patil
EclecticAI
Published in
9 min readFeb 15, 2024

--

Part III — A step-by-step guide on how to design a ML modeling pipeline with scikit-learn Functions.

Photo by Unsplash

Earlier we saw how to collect the data and how to perform exploratory data analysis. You can refer part-I and part-II of this article.

Now comes the exciting part …. Let’s use those fancy algorithms to make predictions from our data.

Although there are many resources which go into the details about how we should tackle this task, sometimes it can be overwhelming.

Here I will try to explain what my thought process was during this project. How and why, I took each step. This way you will get reasoning behind every step which you will be able to use while working with different data as well.

Let’s start with importing basic libraries and load the dataset.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


df = pd.read_csv('after_eda_data.csv')
df.info()

Later in this article we will be using the sklearn.pipline.Pipline . The Pipeline class lets you connect data transformers with a model.

To be able to use the pipeline we need to separate the numerical and categorical features since there will be different preprocessing steps for them.

Get the list of numerical features:

numerical_vars = df.select_dtypes(exclude= "object")
numerical_vars = numerical_vars.columns.tolist()
numerical_vars.remove('Resolution_(Å)')
numerical_vars

Get the list of categorical features:

categorical_vars = df.select_dtypes(include= "object")
categorical_vars= categorical_vars.columns.tolist()
#categorical_vars.remove('Resolution_(Å)')
categorical_vars

Splitting the dataset into test and train set:

Standard way of splitting the data is 80:20 ratio. For that we will get all the features except for our target variable “Resolution_(Å)” in X variable.
Using train_test_split() function to split the dataset into X_train, X_test, y_train, y_test.

from sklearn.model_selection import train_test_split
features = [x for x in df.columns if x != 'Resolution_(Å)']

X = df[features]
y = df['Resolution_(Å)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=124)
print(f'Training examples: {X_train.shape[0]:,}')
print(f'Test examples: {X_test.shape[0]:,}')

Output :

Creating preprocessing pipeline for numerical and categorical columns

As I said earlier, we will be using Pipeline() from sk-learn. We will import libraries to preprocess our numerical and categorical features.
In the pipeline for numerical variables I am using StandardScaler() and for categorical variables OneHotEncoder().

Why am I using StandardScaler() ?

Many ML optimizing functions assume that data has variance in the same order that means it is centered around 0. In this example we have multiple numerical features which have different data variance. There is a high chance that a large magnitude variance feature can dominate the learning from other features.

StandardScaler() will bring the values of all the features in the same range by shifting the distribution to have a mean of 0 and a standard deviation of 1. This will make sure every feature has equal contribution in the model learning.

OneHotEncoder() will convert categorical values to dummy numerical values.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler


num_pipeline = Pipeline([('scale', StandardScaler()) ])

car_pipeline = Pipeline([('create_dummies_cats'
, OneHotEncoder(handle_unknown='ignore', drop='first'))])

1.Logistic Regression

All the preprocessing steps are in the pipeline. Now will use ColumnTransformer() to apply those processes on the feature vector.
ColumnTransformer() will apply different preprocessing methods on varying sets of features and bring them into a collective feature learning space.

I am also adding a LogisticRegression() model in the pipeline along with a data preprocessing pipeline.

Why am I using LogisticRegression() ?

Since our target variable has two classes, we are working with a Binary Classification problem. There are many algorithms which can be used from this task ranging from Logistic regression to Deep learning.

To start with I am using Logistic regression as my base model. Later we will use another algorithm as well to see if we can further improve the result.

from sklearn.linear_model import LogisticRegression

processing_pipeline = ColumnTransformer(transformers=[('proc_numeric', num_pipeline, numerical_vars),
('create_dummies', car_pipeline, categorical_vars)])


lg_model = Pipeline([('data_processing', processing_pipeline),
('logreg', LogisticRegression(penalty='none'))])

I just wrote a few lines to stop those “not so important” warnings. You can skip that part if you want to …

# ignoring the warnings 
import warnings
warnings.filterwarnings("ignore")

Cross_validation will use different variations of the test and train set. This way the model gets to predict on the unknown data which will flag in case our model is overfitting.


from sklearn.model_selection import cross_validate
cv_results = cross_validate(lg_model, X_train, y_train,
scoring=['accuracy', 'recall', 'precision', 'f1_macro', 'roc_auc'], cv=5)
cv_results

The dataset we are using is pretty balanced. So we will use accuracy to optimize our hyperparameters. This cross-validation results shows without regularization. Model is performing fairly with max accuracy of 0.84.

To further improve the results, we will use regularization and PCA.

Why am I using regularization?

Regularization helps to solve the overfitting when we have too many parameters making the dataset complex. So will use l2 regularization in our model.

For the regularization, C parameter in Logistic Regression we will start with values that are spacing on a logarithmic scale.

from sklearn.model_selection import GridSearchCV

# Set of parameter for regularization
params = {'lg__C': [0.01, 0.1, 1, 10], 'lg__penalty':['l2', 'none']}

lg_l2_model = Pipeline([('pp', processing_pipeline),
('lg', LogisticRegression())
])

lg_l2_gscv = GridSearchCV(lg_l2_model, param_grid=params)
lg_l2_gscv = lg_l2_gscv.fit(X_train, y_train)

lg_l2_gscv.best_estimator_
lg_l2_gscv.best_score_

C=10 provides the best model with the training and validation data as it maximizes accuracy. Will search around the value 10 to find if there is value for which will get more accuracy value.

import time

start = time.process_time()

params = {'lg__C': [5, 8, 10, 12,15], 'lg__penalty':['l2', 'none']}

lg_l2_model = Pipeline([('pp', processing_pipeline),
('lg', LogisticRegression())
])

lg_l2_gscv = GridSearchCV(lg_l2_model, param_grid=params)
lg_l2_gscv = lg_l2_gscv.fit(X_train, y_train)

lg_l2_time= time.process_time() - start

lg_l2_gscv.best_estimator_
lg_l2_gscv.best_score_
from sklearn.metrics import classification_report

lg_l2_pred_evt = lg_l2_gscv.predict(X_test)

print(classification_report(y_test, lg_l2_pred_evt))

C=10 is still showing the best performance after a secondary targeted hyperparameter search. There is a slight increase in accuracy score so I will go ahead with this parameter value.

lg_l2_pred_evt = lg_l2_gscv.predict(X_train)

print(classification_report(y_train, lg_l2_pred_evt))

Logistic regression model’s performance is same for test and training data.

Using PCA on numerical columns

PCA will reduce the dimensionality of the dataset. It just summarizes the features in n- features which we can pass as parameters. This method is supposed to identify patterns in data efficiently.

from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA

n_comp_list=[5,6,7,8,9,10]

for i in n_comp_list:
num_pipeline = Pipeline([('scale', StandardScaler()), ('pca', PCA(n_components=i)) ])
car_pipeline = Pipeline([('create_dummies_cats', OneHotEncoder(handle_unknown='ignore', drop='first'))])


processing_pipeline = ColumnTransformer(transformers=[('proc_numeric', num_pipeline, numerical_vars),
('create_dummies', car_pipeline, categorical_vars)])


modeling_pipeline = Pipeline([('data_processing', processing_pipeline),
('logreg', LogisticRegression(penalty='none'))])

pca_model=modeling_pipeline.fit(X_train, y_train)
pca_y_predicted = pca_model.predict(X_test)


print('Accuracy for %d n_components :'%(i),accuracy_score(y_test, pca_y_predicted))

Output:

I have tried using different n-component values for the numeric column pipeline. But it is not showing any improvement. All the n-component values have less accuracy than the previous logistic regression with regularization model.

2.Decision Tree

This will create a predictive model based on simple if-else decisions. I wanted to try a slightly more complex model on the data than Logistic regression.

We will repeat all the steps which we did above. For hyperparameter tuning it will give a set of max_depth and min_smaple_split for GridSearchCV().

from sklearn.tree import DecisionTreeClassifier

num_pipeline = Pipeline([('scale', StandardScaler())])
car_pipeline = Pipeline([('create_dummies_cats', OneHotEncoder(handle_unknown='ignore', drop='first'))])


processing_pipeline = ColumnTransformer(transformers=[('proc_numeric', num_pipeline, numerical_vars),
('create_dummies', car_pipeline, categorical_vars)])


dt_modeling_pipeline = Pipeline([('data_processing', processing_pipeline),
('dt', DecisionTreeClassifier())])

param_grid = [
{'dt__max_depth': [2, 5, 10, 15, 20],
'dt__min_samples_split':[0.01, 0.05, 0.10]}]


gcv_results = GridSearchCV(estimator=dt_modeling_pipeline,param_grid=param_grid)

gcv_results = gcv_results.fit(X_train, y_train)

gcv_results.best_estimator_
gcv_results.best_score_

dt_pred_evt = gcv_results.predict(X_test)

print(classification_report(y_test, dt_pred_evt))

Decision tree classifier is giving best performance for max_depth of 10 or value near to 10 could be preferred. Also, it is using 1% minimum sample split. Currently the model is giving 0.84 accuracy with 85% precision and 86% recall.

To further improve the results of the classifier we will try to search for better values near this first choice of the hyperparameters by doing secondary search.

from sklearn.tree import DecisionTreeClassifier

start = time.process_time()
params = {'dt__max_depth': [8,9,10, 11, 12, 13],
'dt__min_samples_split':[0.005,0.007,0.01, 0.02, 0.03]
}

dt_model = Pipeline([('data_processing', processing_pipeline), ('dt', DecisionTreeClassifier())])

dt = GridSearchCV(dt_model, param_grid=params)
dt = dt.fit(X_train, y_train)

df_time= time.process_time() - start

dt.best_estimator_
from sklearn.metrics import classification_report

dt_pred_evt = dt.predict(X_test)

print(classification_report(y_test, dt_pred_evt))

In secondary search also max_depth =10 but the min_sample_split = 0.005 which is half the value from the first parameter search which was 1%. This change has increased the accuracy by 1% and precision value by 2%.

So far, the Decision tree classifier model with max_depth =10 and the min_sample_split = 0.005 has given the best result.

dt_pred_evt = dt.predict(X_train)

print(classification_report(y_train, dt_pred_evt))

If we compare performance of the model on the test and training data, we don’t see much overfitting. Model’s performance on training data is 1% higher than the test data which is good.

Model Evaluation

print(f'Logistic Regression, Training: {lg_l2_gscv.score(X_train, y_train)}')
print(f'Logistic Regression, Test: {lg_l2_gscv.score(X_test, y_test)}')

print(f'Decision Tree, Training: {dt.score(X_train, y_train)}')
print(f'Decision Tree, Test: {dt.score(X_test, y_test)}')

Output :

Above score shows the model’s performance on test and training data. There is not much of a difference for both the models.

This proves that the model is not overfitting to the data.

Comparing the ROC curves

Here I am plotting a ROC curve for both the best performing Logistic Regression and Decision Tree model. This will give us a better idea about how these two models are performing as compared to each other.

from sklearn.metrics import roc_curve

lg_pred =lg_l2_gscv.predict_proba(X_test)
dt_pred = dt.predict_proba(X_test)

lg_fpr, lg_tpr, lg_thr = roc_curve(y_test, lg_pred[:,1])
dt_fpr, dt_tpr, dt_thr = roc_curve(y_test, dt_pred[:,1])


plt.plot(lg_fpr, lg_tpr)
plt.plot(dt_fpr, dt_tpr)
plt.plot(lg_fpr, lg_fpr, linestyle='dashed')
plt.title('ROC Curve', loc='left')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(['Logistic Regression', 'Decision Tree'], loc='lower right')
plt.show()

The ROC curve for the Decision tree is slightly better than the Logistic regression model. Both the models are showing similar trade-offs between TPR and FPR. Overall, both the curves are far from the diagonal which proves both the model does a fair job at predicting the values.

print("Time taken by Logistic Regression Model:",lg_l2_time)

print("Time taken by Decision Tree Model:",df_time)

The Decision Tree model is taking longer than the Logistic regression model.

Considering all the factors we can say that Decision Tree Classifier is the best performing model so far. If we want to try on a very large amount of data DT model might take more time and it will be computationally expensive as compared to LR for 1% of increase in accuracy.

Thanks for reading!

I hope you enjoyed this article. Make sure to Follow for more!

🟠 Medium’s Boost / Find AI’s Honest Limits / FREE GPT salternative

--

--