Exploratory Data Analysis(EDA)on Biological Data: A Hands-On Guide

Unraveling the Structural Data of Proteins, Part II — Exploratory Data Analysis

Shreya U. Patil
EclecticAI

--

Photo from Pexels

In a previous post, I covered the background of this protein structure resolution data set, including an explanation of key data terminology and details on how to acquire the data. Check out that post, here Explore unique dataset for your upcoming data science project .

This article provides an intuitive guide for exploratory data analysis(EDA) on a real-world protein structure data set, aimed at beginners looking to get hands-on experience with a practical data analysis project. This type of data is used in predict the resolution of a protein structure in drug design procedures.

EDA is one of the most crucial first steps in any data science project. However, many beginners make the mistake of jumping directly into modeling and prediction without taking the time to deeply understand their data.

EDA provides invaluable insights that allow you to design your machine learning pipeline more effectively. By getting to know your data inside and out, you can make better decisions about data cleaning, feature engineering, model selection, and evaluation metrics.

For example, in a binary classification task, if the prediction class label has very few positive examples and you use accuracy as your evaluation metric, a naive model can achieve 90% accuracy by simply predicting every example as negative. By overlooking class imbalance and choosing inappropriate metrics through lack of EDA, you end up with misleading results.

The distribution and relationships within your data dictate the preprocessing and modeling choices required. EDA exposes these patterns through summary statistics, visualizations, and exploratory modeling. There are no shortcuts — you should invest substantial time upfront thoroughly exploring and comprehending your data. Solid EDA is the foundation on which you build interpretable, unbiased, and optimized machine learning systems.

Loading the Data

#importing required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# reading the data in pandas dataframe
df = pd.read_csv('protein/pdb_data.csv',low_memory=False)

1.Feature selection

# Changing the size of the output cell so that visualization fits in the cell. 

from IPython.core.display import display, HTML
display(HTML("<style>div.output_scroll { height: 44em; }</style>"))

#visualizing the missing data
import missingno as msno
msno.matrix(df)
plt.show()
Missing data in each feature

If we consider all 42 features it will be a very complex model. Since the goal of the project is to find the relation between these features and the resolution value. It will be valid to use the only features which do not have a significant amount of missing values. Therefore dropping the columns with large missing values.

2.Unique value feature

Also, dropping the EntryID and PDB ID columns makes sense as they have unique values in each row which makes them independent variables. In EDA, independent variables don’t add much value so we will drop them.

3.Date feature

Date and time variables are very useful in time series data, where time is correlated with other variables. In this example we are not doing any time-related prediction so we can drop the Collection Date column.

df= df.drop(['Entry ID','R All', 'B Wilson Estimate', 'pH', 'Average B Factor',
'Overall Redundancy','Diffraction Source Synchrotron Site',
'R Value for Merging Intensities (Observed)',
'Diffraction Source Synchrotron Beamline','Unnamed: 41','Space Group',
'Z-number','Collection Date', 'Crystallization Method',
'Crystal Growth Procedure','PDB ID', 'Percentage of Possible Reflections',
'Diffraction Source Make, Model or Name' ], axis=1)

Structure of input data after feature selection.

#Get the transpose of top 5 rows from dataframe

df.head().T
# Replacing the space in column name by _ 

df.columns = df.columns.str.replace(' ','_')

4. Handling missing values

Our dataset contains several features like Detector, Temp, and R Observed that have a large number of missing values — around 10,000 each. Since each observation in the dataset represents the data for a different protein, simply replacing the missing values with the mean or median would not be appropriate. The proteins have very different properties, so imputing summary statistics would distort the data.

Instead, we will take a list wise deletion approach and only use variables that have no missing values. This reduces our sample size, but retains the integrity of each protein’s data.

While list wise deletion has drawbacks like reduced statistical power, it avoids improperly imputing values that could introduce bias.

For features with lesser amounts of missing data, we will consider approaches like multiple imputation to estimate missing values from correlations with other features. But with such a large portion of data missing, using the observations as-is could lead to misleading models and results. Filtering to complete cases ensures we build models using only actual measured values for each protein.

#dropping the rows which has missing data

df=df.dropna()
df.shape

5.Data distribution

describe() function gives descriptive statistics like mean, median, standard deviation, and quartiles summarize the central tendency, dispersion, and shape of a dataset’s distribution.


df.describe(include='object').T

Refine_ID has only two unique values lets see the distribution of those values.

df['Refine_ID'].hist()
plt.show()

Looks like there are only few values of Neutron Diffraction. So this column is not much informatory so dropping it.

df= df.drop(['Refine_ID'], axis=1)

6.Data validation

a. Changing datatype

Resolution which is a target variable has numeric values but still it is in object data type.

While converting the data type we observed that in the resolution column for some proteins there are two values. First one is by X-ray diffraction and the second one is from neutron diffraction. Since this study is focused on X-ray diffraction properties dropping the values of neutron diffraction.

df[df['Resolution_(Å)']=='1.752, 2.194']
#Converitng the data type of variable to float
df['Resolution_(Å)'] = df['Resolution_(Å)'].str.split(',').str[0]
df['Resolution_(Å)'] = df['Resolution_(Å)'].astype(float, errors = 'raise')
df['Resolution_(Å)'].dtypes
df['Total_Number_of_Polymer_Instances_(Chains)'].value_counts()[:5]

Total Number of Polymer Instances is a integer value which is stored as float. Converting it to the integer datatype.

df['Total_Number_of_Polymer_Instances_(Chains)']= df['Total_Number_of_Polymer_Instances_(Chains)'].astype(int, errors = 'raise')

b. Conversion of Target Variable

Resolution is the measure of details observed in a diffraction pattern. So high-resolution structures, with resolution values of 1 Å are highly ordered and it is easy to see every atom in the pattern.

If the protein in the crystals are aligned in a way that upon diffraction it will show the fine details of the crystal it becomes a perfect crystal for the study.

In contrast if the proteins are not aligned properly due to local flexibility or motion then we won’t be able to see much of the details.

For this project we will be converting these resolution values into two groups as good(Yes) or bad(No) with 2 Å as threshold resolution.

Reference : Resolution

# Creating list with new values

re_list=[]
for ind in df.index:
if df['Resolution_(Å)'][ind] <= 2:
re_list.append('Yes')
else:
re_list.append('No')
# Visualizing the class distribution in Resolution 

df['Resolution_(Å)']=re_list
df['Resolution_(Å)'].value_counts().plot.barh()
plt.title('Resolution')
plt.xlabel('Number of Protien Structures')
plt.show()

Target Resolution has equal number of both ‘Yes’ and ‘No’ values. There is no class imbalance for this target.

7 .Converting the length feature

Length_a_(Å),Length_b_(Å) and Length_c_(Å) represent the value of sides of the crystal cell. Instead of using these values individually it will be effective if we calculate the volume from these values. So that we can find the correlation between the size of the crystal and the resolution.

# calculating the crystal cell volume
temp_df = pd.DataFrame()
temp_df["a*b"] = df["Length_a_(Å)"] * df["Length_b_(Å)"]
df["Crystal_cell_volum"]= temp_df['a*b' ]*df['Length_c_(Å)']
df= df.drop(['Length_a_(Å)', 'Length_b_(Å)','Length_c_(Å)'], axis=1)

8. Creating the list of numerical and categorical columns

# list of numerical variables
numerical_vars = df.select_dtypes(exclude= "object")
numerical_vars = numerical_vars.columns.tolist()
numerical_vars
# list of categorical variables
categorical_vars = df.select_dtypes(include= "object")
categorical_vars= categorical_vars.columns.tolist()
categorical_vars
# Removing the resolution variable since it is a target variable.
categorical_vars.remove('Resolution_(Å)')
categorical_vars

9. Checking for Multicollinearity

We have 17 numerical feature variables so there is very high risk of having a multicollinearity problem.

Multicollinearity makes it difficult to determine the individual effect of each independent variable on the dependent variable.

To identify it we are using The Variance Inflation Factor (VIF). It represents how well the variable is explained by other independent variables.

#gettign VIF values for numerical variables

from statsmodels.stats.outliers_influence import variance_inflation_factor
X_variables = df[numerical_vars]
vif_data = pd.DataFrame()
vif_data["feature"] = X_variables.columns
vif_data["VIF"] = [variance_inflation_factor(X_variables.values, i) for i in range(len(X_variables.columns))]
vif_data

Result shows Angle_alpha_(°), Temp_(K), R_Work, R_Observed, High_Resolution_Limit, Data_Collection_Resolution have very high VIF score so removing those variables.

Generally VIF values which have higher than 10 are also removed. In this particular data we are trying to find the relation I am keeping the values which are between 10 to 300 at this stage.

df= df.drop(['Angle_alpha_(°)', 'Temp_(K)','R_Work','R_Observed','High_Resolution_Limit',
'Data_Collection_Resolution', ], axis=1)

10. Bi-variate analysis

# Scatter plot matrix of all the features

axes = pd.plotting.scatter_matrix(df, figsize=(20,20))
for ax in axes.flatten():
ax.xaxis.label.set_rotation(15)
ax.yaxis.label.set_rotation(75)
ax.yaxis.label.set_ha('right')


plt.gcf().subplots_adjust(wspace=0, hspace=0)
plt.show()

From the above scatter matrix we don’t see any relationships between columns except for the Matthews_Coefficient and Percent_Solvent_Content.

a. Matthews_Coefficient Vs Percent_Solvent_Content plot

# Matthews_Coefficient Vs Percent_Solvent_Content plot

plt.plot( df['Percent_Solvent_Content'],df['Matthews_Coefficient'], 'bo', alpha=0.1)
plt.ylabel('Matthews_Coefficient')
plt.xlabel('Percent_Solvent_Content')
plt.title('Correlation between Matthews_Coefficient and Percent_Solvent_Content')
plt.show()

Matthews Coefficient represents the density of the crystal and Percent Solvent content represents the amount other contents than protein present in the crystal. Plot shows that as percent solvent increases there is not much of change in the density till 60%. After that density is increasing significantly.

b. Matthews_Coefficient Vs Collection_Temperature plot

#Matthews_Coefficient Vs Collection_Temperature plot

plt.plot( df['Collection_Temperature'],df['Matthews_Coefficient'], 'bo', alpha=0.1)
plt.show()

Don’t see any correlation between the mean recorded temperature of the crystal and the density.

c. Percent_Solvent_Content distribution plot

# Percent_Solvent_Content distribution plot 
plt.hist(df['Percent_Solvent_Content'], bins=100)
plt.title('Percent_Solvent_Content Distribution')
plt.show()

Percent solvent content of the other solution in crystal cell value is normally distributed with mean at 50%.

11. Categorical Data Cleaning

Categorical values are in different cases converting them to uppercase.

# Get the most common value in Diffraction_Source_General_Class variable

df['Diffraction_Source_General_Class'].value_counts()
# Converting Diffraction_Source_General_Class values to uppercase
df['Diffraction_Source_General_Class']=df['Diffraction_Source_General_Class'].str.upper()
# Get the most common value in Detector variable
df['Detector'].value_counts()
# Get the top 10 values in Structure_Determination_Method
df['Structure_Determination_Method'].value_counts()[:10]

So, in this blog post, we’ve been digging into EDA steps like dealing with missing values, feature conversion and multicollinearity. We are familiar with our data and the trends within it. Now, the exciting part is coming up next: we are going to learn how to make prediction models!

Check out the related articles below.

--

--