Introduction

The COVID-19 outbreak has presented an unprecedented worldwide health emergency, with more than 704.75 million confirmed cases and over 7.01 million mortalities globally as of 06 June 2024, according to https://www.worldometers.info/coronavirus/. At present, there is no certain drug available to treat COVID-19, and the development of effective cures has become a priority for researchers globally1. COVID-19 is triggered by SARS-CoV-2, a positive-sense single-stranded RNA virus that mainly infects the respiratory tract of humans2. When the spike protein attaches to the ACE2 receptor on the surface of human cells, the virus enters the cell, and then it utilizes the host’s cellular machinery to replicate and spread throughout the body. Figure 1 depicts the life cycle of a coronavirus.

Fig. 1
figure 1

Life cycle of a coronavirus.

To support viral replication, SARS-CoV-2 uses various viral proteins, among which the main protease 3CLpro (also called main protease Mpro) plays a crucial role in cleaving the viral polyproteins into functional non-structural proteins necessary for viral replication. As a result of its significance in the viral life cycle, the main protease 3CLpro is central to viral replication and has been a priority target for drug development because of the highly conserved structure and lack of human homolog, decreasing the likelihood of off-target toxicity. The catalytic dyad of His41 and Cys145 in the homodimeric cysteine protease 3CL protease makes it an attractive candidate for the development of protease inhibitors3. Several studies have reported the successful identification of small molecules and peptides that can effectively inhibit the activity of the 3CL protease in vitro. However, the development of specific and potent inhibitors for the 3CL protease remains a challenge.

In recent years, computational approaches have become increasingly important in drug discovery, especially in the primary stages of drug development. In silico approaches such as molecular docking and machine learning have the potential to speed up drug discovery by screening large numbers of compounds and predicting their potential binding affinities with target proteins. Molecular docking is a commonly used computational method for calculating the binding of small molecules to target proteins. Recently, machine learning algorithms have been progressively employed to improve the accuracy of molecular docking predictions. Regression models have been used specifically to predict binding affinities, which are essential for detecting potential drug candidates. However, they too have their caveats. It is possible that in silico models fail to completely reproduce biological system complexity, such as protein flexibility of conformation, allosteric control, and candidate drug pharmacodynamics/pharmacokinetics4. Additionally, deficiencies in real-time biological contextual and off-target effect predictions have the potential to restrict the translation potential of computed results. In a recent study5, a combination of docking, machine learning (ML), and molecular dynamics (MD) calculations was used, and seven compounds were recognized as having the potential to inhibit COVID-19 Mpro. In another study6, researchers employed the techniques of molecular docking and MD simulation and identified the four drug candidates DB07299, DB01871, DB04653, and DB08732 to combat 3CLpro of COVID-19.

Drug repurposing has become a critical strategy in the face of public health crises, providing a quicker and less expensive option compared to de novo drug discovery. Its success in previous viral epidemics offers a strong precedent for its use in the current COVID-19 pandemic. In the 2014–2016 Ebola epidemic, for example, favipiravir7, which was initially developed for influenza, was repurposed and tested in clinical environments, exhibiting antiviral activity and contributing to future pandemic preparedness. In the same way, amidst the Zika virus outbreak (2015–2016), sofosbuvir8, a nucleotide analog drug for hepatitis C, was searched for its ability to suppress Zika virus replication. These instances highlight the flexibility of repurposing methods for various RNA viruses. Again, in the 2009 H1 N1 influenza pandemic, extended use of antivirals such as oseltamivir, including off-label use of drugs with previous antiviral approvals, followed. Past SARS (2002–2003) and MERS (2012) outbreaks also spurred drug repurposing initiatives, such as the exploration of protease inhibitors and RNA polymerase inhibitors, which now guide therapeutic screening pipelines for SARS-CoV-2.

The repurposing of drugs for SARS-COV-2 comprises determining novel therapeutic applications for the existing drugs. It has become a good strategy owing to the persuasive need for effective treatments. The approach involves screening the existing drugs against SARS-CoV-2 targets, with the purpose of identifying compounds that can prevent viral replication or attenuate the host immune response to the virus. Several studies have used computational methods to identify potential drugs for COVID-19, including 3 CL protease inhibitors. For example, in a study9, through the application of molecular docking and MD simulation methods, the authors identified two hit candidates, namely CMP4 and CMP2, that exhibit a solid interaction with the 3 CLpro. In a recent study10, we employed a hybrid approach of QSAR, ADMET analysis, and molecular docking to detect the potential inhibitors against the 3 CL protease of COVID-19. As an outcome, six bioactive molecules were identified, each with a unique ChEMBL ID: 187460, 222769, 225515, 358279, 363535, and 365134. These compounds demonstrate potential as effective inhibitors of the 3CL protease and could serve as promising candidates for further study and development as treatments for COVID-19. Another study11, utilized virtual screening techniques to repurpose existing drugs for the cure of COVID-19. The authors identified two drugs, lurasidone and talampicillin, as potential candidates. Additionally, they recognized two drug-like molecules using the Zinc database. MD simulation and ADMET analysis were conducted to assess the stability and pharmacokinetic properties of the identified compounds. In a separate study12, molecular descriptors were computed using random forest, logistic regression, and SVM through a deep learning method. This information was then used in QSAR modeling to estimate the binding affinities of selected drugs with target protease. To develop effective COVID-19 treatments, ML-based computational techniques that successfully identify compounds with strong binding affinity have the potential to reduce the cost and time-consuming experiments.

Several recent studies13 have reported the effectiveness of repurposed drugs against SARS-CoV-2, such as remdesivir, which has received emergency use authorization from the FDA for COVID-19 treatment, as well as some other drugs that have shown promising results in preclinical studies. Conversely, the process of identifying effective repurposed drugs against COVID-19 remains challenging due to the complex and quickly evolving nature of the disease. Computational approaches14, such as molecular docking, have become valuable tools in the identification of potential drug candidates. These approaches allow for the fast screening of large numbers of compounds against specific targets, providing valuable insights into the binding interactions and potential efficacy of the compounds. These studies15 demonstrate the potential of computational methods in identifying potential drugs for COVID-19, particularly those targeting the 3CL protease. Nevertheless, an ML-based framework is necessary to accelerate the identification of potential therapeutic candidates.

While several in silico studies have indicated potential 3CLpro inhibitors, few of them have utilized comprehensive ML-based QSAR modeling along with large, well-curated panels of approved drugs. Most available methods are based on traditional docking or small sets of compounds, potentially missing lead drug candidates. This research proposes to bridge this gap by employing cutting-edge QSAR modeling strategies to discover high-affinity 3CLpro inhibitors from a broad panel of approved medicines. Through the integration of ligand-based molecular descriptors with solid machine learning algorithms, this research attempts to enhance prediction accuracy and pertinence towards ultimately yielding repurposable drugs for the treatment of COVID-19.

In this study, we proposed an ML-based framework for exploring the potential drug candidates against COVID-19. This framework would help to screen the FDA-approved with other world-approved drugs for repurposing as potential COVID-19 cures, particularly targeting 3CLpro. Initially, we extracted 5903 drug candidates from the Zinc database. We accomplished molecular docking to calculate the binding affinities of the drugs towards the target protease 3CLpro using well-known AutoDock-Vina software16. To improve the efficiency of the drug repurposing approach, we modeled the binding affinities of the drugs towards the target protease 3CLpro using several ML methodologies. Our research highlighted the latent benefits of associating molecular docking with machine learning methodologies. This association provides a powerful approach for the prompt screening and identification of potential drug candidates.

In this proposed work, we selected several ML regression models, such as Decision Tree regression (DTR), Extra Trees regression (ETR), Multi-Layer Perceptron regression (MLPR), Gradient Boosting regression (GBR), XGBoost regression (XGBR), and K-Nearest Neighbor regression (KNNR). We used the Zinc database to extract the world-approved, including FDA-approved, drugs. We performed molecular docking using well-known AutoDock-Vina software (version 1.2.0 (https://vina.scripps.edu/downloads/) and estimated the binding affinities of the drugs towards the target protease. Their H-bonding and hydrophobic interaction are comprehensively studied. In the next step, 12 diverse types of molecular descriptors are calculated using PaDEL descriptor software17. Various regression models are built and trained on these diverse types of feature descriptors. The input dataset is divided into two parts, in which 80% of the internal dataset is used for 5-fold cross-validation to improve the performance of regression models. The remaining 20% of the data is used as an external dataset for testing these models. The simulated results of regression models are assessed using statistical measures of R² and RMSE. We observed that our proposed DTR model has improved R² and RMSE values as compared to other regression models. Further, the DTR model has performed better on ten feature descriptors and outperformed on the other two feature descriptors of CDK fingerprint and MACCS fingerprint. This highlights that the DTR model is most suitable in predicting the binding affinities. Further, we analyzed the physiochemical properties of the shortlisted compounds to understand the behavior of drug molecules in biological systems and determine their efficacy and safety.

The succeeding sections of this paper are arranged as follows: Sect. "Material and methods" explains the material and the proposed computational framework employed in this study. The results and the corresponding discussions are presented in Sect. "Results and discussion". At the end, Sect. “Conclusion” outlines the concluding remarks of our investigation.

Materials and methods

The computational framework proposed in this study is comprised of three modules, as depicted in Fig. 2. Module A encompasses various steps involved in preparing the input dataset. On the other hand, module B illustrates the process of molecular docking, which is used to compute the binding affinities of the drugs with the target protease 3CL. Lastly, module C describes the building of the QSAR model and its performance comparison with different state-of-the-art ML models.

Fig. 2
figure 2

Three main modules (A to C) in the proposed computational framework.

Module A: dataset preparation

Module A outlines various stages of data preprocessing, which are as follows:

Targeting the viral enzyme

The main protease (Mpro), also known as the 3CLpro, is a critical drug target among the proteins of coronaviruses due to its unique enzymatic properties18. This protease, along with the papain-like protease (PLpro), plays a central role in the transcription and replication process of the viral RNA. This makes it an essential enzyme for the survival and replication of the virus. The high conservation and replication of 3CLpro make it a promising drug target to discover inhibitors that may strongly bind to the target protein and potentially inhibit viral replication.

Dataset

The Zinc database is used to extract the world-approved, including FDA-approved, drugs19. Initially, a dataset of 5903 drugs was obtained from https://zinc20.docking.org/on10/02/2023. The Zinc is a publically available database having more than 1.4 billion compounds. Every week data is downloaded from this site in terabytes. More than 90% available compounds are verified.

Data preprocessing

The Zinc dataset consists of 5903 approved drugs that are available in SMILES format. First, these SMILES are converted into SDF format using OpenBabel-2.4.1 software (https://open-babel.soft112.com/)20. Then, SDF files are transformed into PDBQT format so that these drugs can be used to calculate binding affinities towards the target protease. Those files that could not be converted are dropped. After this preprocessing step, the input dataset is reduced to 5537 drugs.

Module B: molecular docking

The 3 CLpro crystal structure (PDB ID: 7JSU) is retrieved from the RCSB Protein Data Bank on 10/02/2023. The ligands were removed from the structure to purify it. The water molecules and alternative side chains were also eliminated. Further, the polar hydrogen atoms were added, and Kollman charges were distributed to prepare this macromolecule in a charged form. A GridBox of dimensions 30 × 30 × 30 with a spacing of 01 is fixed to cover the active site of the 7JSU protease, with centers of x, y, and z coordinates adjusted at − 11.046, 12.826, and 67.749, respectively. For molecular docking, AutoDock VINA, version 1.2.0 (https://vina.scripps.edu/downloads/), is employed with default parameters and exhaustiveness parameter was set to 8. Ligands in PDBQT format are prepared using OpenBabel software. The binding affinities are calculated in kcal/mol. The orientation of ligand with target protease interaction having the lowest binding energy is considered the best pose. The crystal structure at the resolution of 1.83 Å of the SARS-CoV-2 3CL protease 7JSU is depicted in Fig. 3.

Fig. 3
figure 3

Crystal structure of SARS 3CL protease 7JSU.

AutoDock Vina was selected for its efficiency and power of docking, especially in the case of small-molecule ligands binding to macromolecular targets like 3CL protease. Although it yields useful initial estimations regarding binding affinity and pose prediction, certain limitations need to be pointed out. One main limitation is the receptor rigidity assumption, which does not properly consider the conformational adaptability of proteins like 3CLpro21. Even though Vina does allow some degree of flexibility in chosen side chains, it cannot simulate large-scale protein dynamics or solvent interactions22. Moreover, accuracy in docking could be compromised in the case of very large ligands. Which may need to be broken into smaller, tractable substructures. Notwithstanding these issues, molecular docking continues to be an accepted and potent method of virtual screening and hypothesis generation at the initial phases of drug discovery.

Module C: QSAR modeling

For QSAR modeling, we have selected several ML-based regression models, such as DTR, ETR, GBR, MLPR, and KNNR. These models predict the quantitative structure-activity relationship (QSAR) between the structural properties of chemical compounds with unknown biological activities. These models have successfully established a correlation between the structural characteristics of known chemical compounds and biological activities. The structural properties denote the physicochemical properties that define the compound’s structure, while biological activities represent their pharmacokinetic properties. The molecular descriptors of compounds enable the prediction of how changes in structural characteristics affect biological activity23.

Data cleaning

To begin the data preparation process for ML models, the dataset undergoes a thorough cleaning procedure to eliminate any instances of data duplication. Additionally, drugs that lack binding affinity values are removed from the dataset to ensure a high quality of data. Following this rigorous cleaning process, the final dataset comprises a total of 4639 drugs that are deemed suitable for subsequent analysis and modeling. The removal of duplicated data and the exclusion of drugs without binding affinity values ensure that the dataset is accurate, reliable, and fit for purpose.

Feature extraction

The molecular constituents of drug molecules were represented by a vector of fingerprint descriptors. Before computing the descriptors, the PaDEL-Descriptor software’s built-in function was used to standardize tautomers and eliminate salts. In this study, we investigated the effectiveness of 12 diverse fingerprint descriptors to predict the binding affinities of drug molecules. Table 1 provides a summary of the utilized fingerprints, including their respective size and description,

Table 1 Summary of twelve fingerprint descriptor sets.

We used 12 different molecular fingerprint descriptors for QSAR modeling in this study to cover a wide range of structural and physicochemical properties of the drug molecules. CDK and MACCS fingerprints were selected because they are widely used and have been shown to be effective in virtual screening and QSAR applications24. CDK fingerprints have been recognized to encode a broad spectrum of substructural features, whereas MACCS keys reflect a compact and readable representation of functional group presence and are thus especially beneficial in pattern recognition tasks. Nevertheless, we recognize some inherent limitations of these descriptors. For example, both CDK and MACCS fingerprints are 2D-based and may not possess the capability to encode complex three-dimensional interactions. Moreover, these fingerprints could be sensitive to molecular size and may not truly capture conformational flexibility and electronic effects important in biological activity. Despite limitations, their fusion provides a pragmatically beneficial tradeoff of computational cost against informative molecular coding for machine-learning-based prediction tasks.

Decision tree regression (DTR) model

The aim of this research is to develop regression models capable of accurately predicting the continuous response variable, specifically binding affinity, by utilizing a range of predictor variables, such as fingerprint descriptors. To achieve this objective, multiple ML algorithms are developed for QSAR modeling. Among these models, the DTR approach is selected due to its superior prediction performance. In machine learning, a DTR25 is a predictive model that uses a decision tree to make predictions. The decision tree is a type of graphical model, consisting of nodes, branches, and leaves, that resembles a flowchart. Each internal node in the decision tree represents a test on an attribute, and the outcome of the test is represented by the branch, and each leaf node denotes a prediction or class label. In a DTR26, the leaf node shows a continuous value, such as the average or the median of the target values in the training samples that belong to the same leaf node. In DTR, feature space is recursively partitioned into subsets based on the values of the features in such a manner that maximizes the reduction of the variance of the target variable. This process continues till the stopping criterion is met. It has several advantages27, such as its interpretability, its ability to handle non-linear relationships between the features and the target variable, and its resistance to over-fitting.

This study shows that the Decision Tree Regression (DTR) model outperformed others in predicting how well ligands bind to the SARS-CoV-2 main protease (3 CLpro). We chose DTR because it’s accurate and works well for modeling structure-activity relationships in drug discovery. DTR makes it easy to understand how specific molecular descriptors affect binding strength. This clarity helps a lot in the early stages of drug screening, where knowing which features matter most can guide smart design choices. Also, DTR is good at capturing the complex non-linear connections between a molecule’s properties and its biological activity. This is key when dealing with the intricate ways proteins and ligands interact. The model can handle both category-based and continuous variables without needing much data scaling, which makes it useful for cheminformatics. These qualities make DTR a top pick to predict potential 3CLpro inhibitors.

The total number of drug molecules in the input dataset, as described in Sect. 2.3.1, is 4639. This dataset is divided into internal and external datasets with an 80 to 20 ratio. The internal dataset is used to train and robust the model performance by employing 5-fold cross-validation. For this purpose, 12 diverse types of molecular descriptors, described in Sect. 2.3.2, are used as the feature sets. The external dataset is used to test the performance of the model.

To assess the usefulness of the developed regression models, three statistical variables, namely coefficient of determination (R²), predictive squared correlation coefficient (Q2), and root mean square error (RMSE), are utilized. The R2 value is a measure of the fraction of variance in the dependent variable that can be described by the independent variables. A value of 0 states a poor fit, while a value of 1 indicates a perfect fit. How well a model predicts on test data is indicated by the Q2 metric. Better generalizability of the model on unknown data is indicated by a higher Q2 value. R2 and Q2 metrics are helpful for evaluating the model’s fitting and predictive capabilities in regression modeling. The rule of thumb is the difference should be less than a 0.3 between the R2 and Q2 readings. With a Q2 score of 0.5, a model does well in regression; a value of 0.9 or higher denotes exceptional performance. On the other hand, RMSE provides a measure of the relative error of the predictive model. To compare the performance of different regression models, a comparative analysis is conducted. For comparative analysis, we utilized two types of fingerprints as the feature sets. One is CDK Fingerprint, and the other one is MACCS fingerprint.

Results and discussion

In this work, we developed several ML-based QSAR models for drug repurposing against COVID-19 and predicted their binding affinities for approved drugs towards the target protease 3CLpro. First, we will assess the performance of our proposed model, DTR, in predicting binding affinities using twelve distinct feature sets. Then compare its performance with several QSAR models using important statistical measures of R2 and RMSE. Then we will explain the results of molecular docking conducted on the world-approved drugs and their interactions with the target protease 3CLpro. Finally, we will conduct the physiochemical analysis of shortlisted drug compounds with respect to the efficacy of the drugs towards the target protease 3CLpro.

Evaluation of QSAR model

The current research proposes a methodology to construct a QSAR model based on the Decision Tree Regression (DTR) algorithm. The model is developed using a dataset of 4639 drug molecules, which is explained in detail in Sect. 2.3.1. To evaluate the performance of the proposed model, 12 distinct types of feature sets are used, as discussed in Sect. 2.3.2. To construct the data matrices, fingerprint features are placed in the X matrix, while the Y matrix consists of their corresponding binding affinities. The dataset is divided into an internal dataset (80%) and an external dataset (20%). The internal dataset is utilized for training and to validate the model’s robustness by employing 5-fold cross-validation, which acts to reduce over-fitting and gives a more generalized estimate of the model performance. The external dataset is kept for independent testing to evaluate the predictive ability of the final trained model.

The two well-known statistical measures, R2 and RMSE, are used to assess the performance of our proposed QSAR model. Q2 is also used to validate the model performance by employing 5-fold cross-validation. R2 is used to evaluate the model’s fitness and to quantify how much variation in the dependent variable (binding affinity) is explained by the independent variables (features). It ranges between 0 and 1, with higher values indicating better model performance. Conversely, RMSE measures the relative error between the predicted and actual values of binding affinity. To demonstrate the effectiveness of the DTR model, Fig. 4 displays the actual and predicted binding affinity for 12 distinct feature sets.

Fig. 4
figure 4

Scatter plot of 12 feature descriptors for DTR model.

In this investigation, the effectiveness of the DTR model is analyzed using 12 various feature descriptor sets. The evaluation outcomes, including Q2, R2, Mean Squared Error (MSE), and RMSE values, are illustrated in Table 2.

Table 2 Performance of DTR model for twelve distinct fingerprints.

Table 2 presents the performance evaluation of 12 different feature descriptors in predicting the binding affinity of a set of drug compounds. Four performance metrics are used to evaluate the model’s performance, namely Q2, R2, MSE, and RMSE. The Q2 is used to measure the consistency of the model. An R2 metric measures the goodness of fit of the model to the data, where higher values indicate a better fit. The MSE metric calculates the average of the squared differences between the predicted and actual values, while the RMSE metric is the square root of the MSE, and it estimates the error in the same units as the target variable. According to Table 2, there is a small difference between the values of Q2 and R2 within the range of 0.03–0.23. This shows that the DTR model is effective in predicting the binding affinity of drug molecules with 3CLpro.

The results show that CDK fingerprint and MACCS fingerprint outperformed other fingerprints with a higher R2 value of 0.97 and a lower RMSE of 1.68 and 1.57, respectively. Achieving an R2 score of 0.93 and an RMSE of 1.87, the PubChem fingerprint displayed strong performance. Other fingerprint descriptors, such as the extended CDK fingerprint, the 2D atom pair count fingerprint, the E-state fingerprint, the graph-only fingerprint, the Klekota-Roth count fingerprint, the substructure count fingerprint, and the 2D atom pair fingerprint, displayed varying degrees of prediction accuracy.

With regard to numerous substructures, functional groups, and molecular features, the CDK fingerprint and MACCS fingerprints provide a thorough description of molecular structure and attributes. They are well suited for exploring a variety of chemical spaces and finding possible candidates for drug repurposing because of their capacity to capture a variety of chemical properties. Taking into account these characteristics, we have chosen the CDK fingerprint and MACCS fingerprints for additional comparison with different ML models while utilizing the proposed DTR model. The data in the table is useful for determining the feature descriptors that will best predict the binding affinity of a group of drug molecules.

Comparative analysis

A comparison is carried out to assess the predictive performance and accuracy of our proposed QSAR model with other regression models using CDK fingerprint and MACCS fingerprint feature descriptors. The statistical measures R2 and RMSE are used as evaluators.

Using the CDK fingerprint feature descriptor for the DTR model, Fig. 5 shows a graphical comparison between the actual and predicted binding affinity values. The outcomes demonstrate that the proposed model DTR outperforms the other models with the greatest R2 value of 0.97 for the external dataset. This suggests that the model fits the data well and that the independent variables have a strong correlation with the dependent variable.

Fig. 5
figure 5

Regression plot of DTR-QSAR model with CDK fingerprint.

The proposed DTR model is also compared with other regression models, using MACCS fingerprint descriptors, to evaluate their performances on the external dataset using R2 and RMSE measures. Figure 6 presents a graphical description of the DTR model using MACCS fingerprint feature descriptors. The results indicated that the proposed DTR model outperformed the other regression models, with the highest R2 value of 0.97 for the external dataset. This signifies that our model explained a large quantity of the variance in the dependent variable based on the given set of independent variables, even though the relationship between the variables may not be strictly linear.

Fig. 6
figure 6

Regression plots of R2 using MACCS fingerprint features for external dataset.

Table 3 highlights the performance comparison of our proposed DTR model with other regression models on two different feature sets, CDK Fingerprint and MACCS Fingerprint, in terms of Q2, R2, and RMSE values. The DTR achieved an R2 value of 0.97 with an RMSE value of 1.68 for the CDK Fingerprint feature set. For the MACCS Fingerprint feature set, the DTR model has obtained the R2 value of 0.97 and the RMSE value of 1.57. However, ETR and KNNR models have obtained relatively lower performance for both feature sets. MLPR and XGBR showed poor fit with lower R2 and higher RMSE values for both feature sets. Overall, quantitative results suggest that our proposed DTR model is a more suitable choice for predicting the binding affinity for both CDK Fingerprint and MACCS Fingerprint feature sets.

Table 3 Performance comparison of QSAR regression models.

Molecular docking

Our main objective was to determine the efficacy of the selected drug molecules in interacting with the target protease. For this purpose, the molecular docking technique was used to predict and analyze the interactions between the ligand and the target protease. It helps in understanding the binding affinity and the orientation of ligands within the protease active site. In our study, we used a ligand-based docking approach to estimate the binding affinities of drug molecules extracted from the Zinc database. The drug molecules were converted into PDBQT format, and their binding affinities with the target protein 7JSU were evaluated in kcal/mol units.

Figure 7a displays the binding pocket of the target protease 7JSU. On the other hand, Fig. 7b shows the 3D interaction view of the complex of 7JSU with ligand 2297 bound to it. This figure depicts the interacting residues of 7 JSU with ligand 2297 atoms along with intermolecular distances. In this interaction, hydrogen bonds are represented by dotted lines shown in green color. A hydrogen bond appears when a hydrogen atom from the protein interacts with an electronegative atom (such as oxygen or nitrogen) from the ligand or vice versa. The distance between the hydrogen donor and the acceptor atom is around 2–3 Å. The distance shows values in the range of 2.36–2.51 Å for hydrogen bonds. These shorter distances propose that the hydrogen bond interactions between the protein and ligand are relatively strong and stable. The detailed numerical description about the hydrophobic interaction and the H-bonding is provided in the next section. These tables emphasize the protein-ligand structural context of hydrophobic contacts, hydrogen bonds, and atomic coordination. This useful information illustrates the structural and energetic aspects of the protein-ligand complex.

Fig. 7
figure 7

(a) The binding pocket of the target protease 7JSU. (b) 3D interaction view of protein- ligand ID 2297 complex.

Figure 8 demonstrates that the 2D view of optimal poses, generated by Discovery Studio Visualizer 2021(Version 21.1) (https://discover.3ds.com/discovery-studio-visualizer-download), of six top-ranked drug compounds interacting with the target protease corresponds to the best binding affinity. These drug molecules have the best binding affinities, ranging from − 15.1 to − 13.6 kcal/mol. The optimal pose refers to the specific orientation and conformation of the ligand molecule that achieves the lowest binding affinity with the target protease. The most negative binding energy value represents the best ligand pose towards the target. 2D views show the interaction of protein-ligand in terms of Van Der Waals, conventional hydrogen bonds, carbon-hydrogen bonds, Pi-cation, Pi-sulfur, Pi-Pi T-shaped, unfavorable donor-donor, alkyl, and Pi-alkyl. These different types of interactions play important roles in the overall stability of protein-ligand binding.

Fig. 8
figure 8

The optimal poses (2D view) of six top-ranked drug compounds interacting with target protease 7JSU.

On the other hand, docking accuracy is evaluated by measuring the root mean square deviation (RMSD) of the ligand from its original position in the protease complex. A lower RMSD value indicates a superior docking geometry of the ligand molecule. Interestingly, our analysis demonstrates that all six ligands achieved an RMSD value of zero at their optimal poses, implying a high level of accuracy in the docking geometry. Table 4 shows the binding affinities (BA) values of the six top-ranked drug molecules and the corresponding amino acid (AA) residues involved in the hydrophobic interaction and H-bonding. The ligands 2297, 4434, 5278, 4440, 3172, and 5471 have obtained the best BA values: −15.1, −14.4, −14.4, −13.9, −13.6, and − 13.6 kcal/mol, respectively. This table shows that ligand 2297, with a minimum of −15.1 kcal/mol, is the most promising drug compound.

The investigation of ligand poses and their corresponding binding energies helps establish a relationship between the molecular structure of the ligands and their binding affinity. It designates the strong interaction and potential efficacy of the ligand molecule as a drug candidate. The identified amino acids involved in hydrophobic interactions and hydrogen bonds offer valuable intuitions into the molecular mechanisms underlying ligand-protein interactions. Further exploration and analysis of these interactions would contribute to the development of novel therapeutic strategies targeting the specific residues and improving the efficacy of drug candidates.

Table 4 Top ranked six ligands with target protease 7JSU.

The molecular docking study indicated that some of the highest-ranked compounds established robust and reproducible interactions with important residues in the 3CLpro active site, most notably Glu166, His163, and Asn142. These residues are important for ligand binding and enzymatic activity. Glu166 is essential to the preservation of the structural conformation of the S1 subsite and has been found to form stable hydrogen bonds with different inhibitors, enhancing binding specificity and complex stability28. His163, an essential part of the S1 pocket, often involves hydrogen bonding and polar interactions, playing a crucial role in ligand orientation and binding affinity29. Asn142, situated close to the S1 pocket entrance, forms polar contacts and assists in regulating the binding site flexibility, impacting ligand accommodation and binding energetics. These interactions present in our docked complexes agree with the known resolved co-crystal structures of 3CLpro-inhibitor complexes to further support the validity of our docking results and the viability of these compounds as repurposable inhibitors.

Hydrophobic interactions

Table 5 provides the hydrophobic interactions between C-H bonds of six ligands with chain A of the 3 CL target protein. Due to their non-polar nature, the residues MET, PHE, GLU, GLN, and LEU revealed the hydrophobic interactions. It is worth noting that these hydrophobic residues are concealed within the protein core. The large side chain-based hydrophobic residues contribute significantly to the establishment of the protein’s hydrophobic core, which plays a key role in maintaining the stability of the ligand-protein structure.

In this table, ligand 2297 exhibits a hydrophobic interaction with residue GLU166 A, where a relatively smaller intermolecular distance of 3.22 Å is found between the protein atom at position 1578 and the ligand atom at position 2889. However, ligand 3172 gives interactions with two amino acid residues of PHE140 and GLU166 with intermolecular distances of 3.62 Å and 3.50 Å, respectively. Furthermore, this table reveals hydrophobic interactions with other ligands 4434, 4440, 5278, and 5471 with residues MET, PHE, GLU, GLN, and LEU. These hydrophobic interactions are characterized by varying intermolecular distances ranging from 3.42 Å to 3.95 Å.

Table 5 Hydrophobic interactions of top-ranked six ligands with the target protease.

H-bonding interaction

An H-bond is a type of intermolecular bond that occurs between an H-atom bonded to electronegative N or O atoms. The acceptor N atom within a protein possesses a lone pair of electrons, which interact with the H atom from the ligand and vice versa. The donor atom within the molecule donates the H-bond. The H-bonds are weaker than covalent bonds. The H-bond interaction analysis helps to understand the structural and geometrical stability of protein-ligand interaction and to improve the physiochemical and drug biological processes. The polar and charged residues in their side chains at different positions such as ASN142, GLU166, THR26, HIS163, and GLY143, etc. The polar residues ASN, GLU, GLN, and THR donate or accept H-bonds. The residue HIS has two -NH groups in the side chains and, depending on the environment and pH level, can be polar. The residues ARG, LYS, and TRP possess N donor atoms in their side chains. On the other hand, the residues ASP and GLU are H-bond acceptor (O) atoms in the side chain.

In the ligand molecules and AA residue combination, the distances between the H-acceptor and donor-acceptor play a significant role in determining the strength of hydrogen bonds. The smaller distances indicate the stronger H-bonds, as the electrostatic interaction between the partially positive hydrogen and the partially negative acceptor atom is stronger when they are closer together. We compared the values of these distances and analyzed them to understand the strength of H-bonding interactions. The optimal (higher) donor angle of each residue is crucial in assessing the strength or weakness of hydrogen bonding. The donor angle represents the spatial orientation of the donor atom involved in the hydrogen bond. The angle influences the alignment and stability of the hydrogen bond, impacting its strength. More favorable and optimal values of the donor angles result in stronger H-bonds, while deviations from the ideal angle (180) can weaken the interaction. Moreover, the protein donor and acceptor atoms also contribute to the strength of the H-bond. These donor and acceptor atoms affect the overall stability and specificity of the hydrogen bonds formed in ligand-protein interaction.

The table provided in Supplementary file S1 highlights the useful values of the H-acceptor distances and donor-acceptor distance, the donor angle, the donor atom, and the acceptor atom. These values are focusing on H-A and D-A distances, donor angle, and protein donor/acceptor atoms. The numerical distance values between specific atoms involved in the hydrogen bonding interactions are given in angstroms (Å). The “Distance H-A” and “Distance D-A” measures represent the “H-Acceptor” distance and “Donor-Acceptor” distance, respectively. These distances provide valuable information about the proximity of the atoms in the hydrogen bond. Their numerical values highlight the strength and geometry of H-bonding interactions in protein-ligand complexes. The specific values of these distances vary depending on the nature of donor and acceptor molecules. The smaller distances indicate the stronger H bonds, which improve the stability of the ligand-protein interaction.

For ligand ID 2297, the residue HIS163 forms an H-bond with a donor angle of 136.68 degrees using the acceptor ligand. The bond is formed between the N + atoms of the donor protein at the 1548 position with the N atom of the acceptor at the 2887 position. The proximity of the H-acceptor distance (2.36 Å) and donor-acceptor distance (3.18 Å) is significant, as the smaller distances indicate the existence of stronger hydrogen bonds for ligand ID 2297. However, for ligand 3172, the ASN142 participates in hydrogen bonding with a donor angle of 120.04 degrees. The donor angle is favorable for the hydrogen bond, enhancing its stability. The H-bond is formed between the protein/donor N atom at position 1359 and the ligand/acceptor oxygen atom at 2893. This introduces more interaction with potential variations in the bonding patterns. For ligand 3172, the optimal smaller values of H-acceptor (3.17 Å) and donor-acceptor (3.78 Å) distances indicate a stronger hydrogen bond.

Similarly, for ligand ID 4434, the residue ASN142 forms a donor angle of 122.86 degrees with the ligand acceptor. The H-bond is formed between the donor N atom at position 1359 and the ligand/acceptor oxygen atom at 2893, potentially leading to a more diverse hydrogen bonding network. The distance values (2.56 Å and 3.23 Å) of H-acceptor and donor-acceptor signify a hydrogen bond. The donor angle of 122.86 degrees, although different from the ideal 180-degree angle, still contributes to the stability of the hydrogen bond.

Continuing to ligand ID 4440, the H bond of the residue ASN142 is more significant as compared to residues GLU166, GLN189, and THR26. This residue has a smaller H-acceptor distance of 2.51 Å with a larger donor-acceptor distance of 4.04 Å. This donor angle contributes to the stability and strength of the hydrogen bond, as it aligns the donor and acceptor atoms optimally. The increased donor angle of up to 155.73 degrees contributes more to the geometrical stability of the H bond. An H-bond is formed between the donor atom at 1359 [Nam] and the ligand/acceptor atom at 2898 [O₂]. Finally, for ligand ID 5278, the residue THR26 has an optimal donor angle of 147.66 degrees. The H-bond is formed between the protein donor nitrogen atom at position 224 and the acceptor oxygen atom at position 2889. For this ligand, relatively larger distances (value 4.07 Å) of donor-acceptor help to decrease donor angle up to 147.66 degrees, although deviating from the ideal angle (180 degrees) still contributes to the stability of the hydrogen bond.

Physiochemical analysis of drug candidates

On the basis of our analysis, we shortlisted six drug compounds having more efficacy and strong interaction with specific target protease 3 CLpro. Table 6 provides a detailed description of these six drug compounds having the lowest binding energies. The description includes the Zinc ID, molecular formula, SMILES, and 2D structure of the drug compounds.

Table 6 Description of top-ranked six drug compounds.

Table 7 provides a few selected physiochemical properties of six potential drug candidates represented by their Zinc ID, molecular weight, logP, hydrogen bond donors and acceptors, number of rings, heavy atoms, heteroatoms, and fraction sp3. These properties play a vital role in the efficacy and bioactivity of the drug compounds. As these properties provide insight into the potential efficacy and pharmacokinetic profile of the drug, these are considered important in drug discovery. To understand the behavior of drug molecules in biological systems and determine their efficacy and safety, the analysis of physicochemical properties is crucial. The binding affinity of drug candidates to certain target proteases is greatly influenced by these properties, including molecular weight (Mol.Wt), LogP, rings, hydrogen bond donors (HBD), hydrogen bond acceptors (HBA), heavy atoms, heteroatoms, and the sp3 fraction.

Molecular weight, for instance, is a crucial consideration when assessing the pharmacokinetics and pharmacodynamics of drugs, as well as their capacity to pass across cell membranes and interact with target proteases. Compounds with optimal molecular weights will have greater potential to achieve beneficial absorption and tissue distribution profiles, whereas overly bulky molecules will risk membrane permeability constraints, consequently lowering bioavailability. Another important determinant is the octanol–water partition coefficient (logP), which is a surrogate for lipophilicity. A greater logP value generally corresponds to greater affinity for hydrophobic sites of target proteins, which can increase binding interactions. Too much lipophilicity, however, can also result in poor aqueous solubility, reduced oral absorption, and greater risk of toxicity through accumulation in lipid-rich tissues. Previous work30 has proven that there exists a robust correlation between logP and the small molecule binding affinity and highlighted the need for optimizing this property in lead development.

Hydrogen bond donors (HBD) and hydrogen bond acceptors (HBA) are also necessary to mediate specific molecular interactions with target residues. Although these characteristics enable strong and selective binding, too many hydrogen bond interactions could compromise membrane permeability and thus impact in vivo efficacy. Likewise, the presence and orientation of ring structure elements in a molecule control target selectivity and oral bioavailability. Structures with two or three rings tend to have a balance of rigidity and flexibility that favors productive target engagement. Conversely, extremely complex or bulky ring systems have the possibility of introducing steric hindrance, which will diminish binding affinity or enhance metabolic liability31. The presence of heavy atoms and heteroatoms, particularly nitrogen and oxygen, can pointedly affect drug-target interactions by forming hydrogen bonds or other electrostatic interactions with target residues. However, an imbalance of either too many or too few heteroatoms might impair the solubility of a medicine, membrane permeability, and other vital aspects of bioavailability.

The crucial property the sp3 fraction of a drug compound has been shown to influence its physicochemical properties and target affinity32. The sp3 fraction signifies the percentage of carbon atoms with three or more single bonds. The higher sp3 fractions are allied with increased water solubility, lower toxicity, and improved pharmacokinetic properties, while extremely low values can lead to poor bioavailability or reduced target selectivity.

Table 7 Physiochemical properties of selected drug compounds.

In the table, the molecular weight of these drugs ranges from 328.5 g/mol to 900.1 g/mol, indicating that the drugs vary broadly in size. We observed that the drugs with higher molecular weight tended to have a higher number of heavy atoms and heteroatoms. This proposes that larger molecules may be more effective in binding to specific targets. However, we also noted that some of the drugs with lower molecular weight have a higher fraction of sp3, which may specify a greater degree of three-dimensional complexity and potentially better binding interactions. Further, the logP values range from 2.732 to 5.527, with most of the drugs having a logP value between 3 and 5. Lipophilicity is key in drug development as it can affect drug absorption, distribution, metabolism, and excretion (ADME) properties. Furthermore, most of the values for HBD are ≥ 4 and HBA are ≥ 8 in this table. Higher HBD and HBA in a drug molecule raise its ability to form multiple hydrogen bonds with the protein’s binding site. This can improve the interactions and contribute to stronger binding affinity.

This table shows that the number of rings in the molecules varies from 4 to 9 and the heavy atom count ranges from 24 to 64. The stability, potency, and selectivity of a molecule can be impacted by the number of rings and heavy atoms present. Higher complexity and potential for interaction with the target receptor are generally found in molecules with multiple rings and heavy atoms.

According to this table, the range in heteroatom count is from 3 to 15, with most of them having between 12 and 15 heteroatoms. These numbers are significant because they can shed light on possible binding interactions between the drug molecules and their specific target. Another key property, the fraction sp3, which denotes the proportion of carbon atoms in the drug that are sp3 hybridized, ranges from 0.42 to 0.90. Most of the drugs have a fraction sp3 value greater than 0.50. A higher fraction sp3 value specifies that a drug has a higher degree of 3D character that can be useful for binding to certain targets.

Overall, these physicochemical properties of the selected six drugs play a crucial role in drug discovery and development and their understanding. These properties are vital for designing drugs with improved efficacy and pharmacokinetic properties. These drug compounds have the potential for repurposing for the treatment of various diseases. These physicochemical properties would be useful in further in vitro and in vivo studies that are necessary to determine the efficacy, safety, and dosage of these drugs for the treatment of specific diseases. This useful analysis of these drugs offers an initial point for drug repurposing research. This highlights the worth of considering the physicochemical properties of the drugs for repurposing purposes. Table 8 provides the brief description of our suggested drug candidates to repurpose against COVID-19 3 CLpro with their generic name and original medication.

Table 8 Proposed drugs for repurposing in this study.

In addition to their high binding affinities for 3CLpro observed in silico, the repurposing potential of the target compounds identified here—stanozolol, vinblastine, vinorelbine tartrate, vindesine, paritaprevir, and 41-O-demethyl rapamycin—has to be considered in the context of their pharmacological profiles and clinical safety. Stanozolol (https://www.drugs.com/stanozolol.html) is an anabolic steroid with anti-inflammatory activity and could potentially be used to mitigate bradykinin-induced complications in COVID-19. But its recognized toxicity, such as hepatotoxicity and cardiovascular disorders, makes it a less desirable choice, especially for individuals with pre-existing conditions. Vinblastine, vinorelbine, and vindesine, which are vinca alkaloids utilized in chemotherapy, block microtubule assembly and may have the potential to interfere with viral replication mechanisms. Although this theoretical benefit33, their intense immunosuppressive and myelosuppressive side effects are a serious issue when it comes to their application in COVID-19 patients, who tend to become already immune-dysregulated. Paritaprevir34, a protease inhibitor approved by the FDA for hepatitis C, fits with antiviral targets of SARS-CoV-2 proteases and is complemented by a good safety profile, and thus it is a more feasible candidate to be repurposed. 41-O-demethyl rapamycin, a rapamycin derivative, is an mTOR inhibitor and has been proposed to prevent cytokine storms and overactive immune responses in critical COVID-1935. However, its immunosuppressive effect can compromise viral clearance and risk sensitivity to secondary infections, and therefore, careful assessment is warranted. In conclusion, although these compounds are computationally promising, their clinical translation requires careful scrutiny of both therapeutic effects and adverse effects.

Since the compounds identified are approved or possess known clinical safety profiles, they are good candidates for quick advancement toward preclinical validation, e.g., in vitro antiviral activity assays, and may qualify to enter Phase II clinical trials aimed at assessing efficacy against SARS-CoV-2 in mild to moderate disease.

While our in silico predictions for the highest-performing drugs indicate high binding affinities and favorable inhibition mechanisms against 3 CLpro, a few discrepancies between these in silico findings and follow-up in vitro results may be anticipated. These discrepancies are due to several factors inherent in computational models and the complexity of biological systems. First, pharmacokinetic processes like absorption, bioavailability, and metabolism are not entirely considered in docking studies, which may lead to overestimation or underestimation of drug activity in biological settings. Second, off-target activities, where the drug interacts with unintended proteins, may disrupt the desired inhibitory activities of the compound. The static character of protein structures employed in molecular docking also differs from the dynamic conformational fluctuations of proteins in living biological systems, resulting in possible differences in binding affinity. In addition, protein-ligand interactions in silico do not always mimic the complexity of interactions in a cellular environment, where protein folding, co-factors, and solvent effects can influence the binding dynamics. These differences emphasize the need for experimental confirmation to validate the computational results, especially in determining drug effectiveness and their cytotoxicity, effective concentration (EC50), and viral load reduction in cell-based assays.

The results of this research identify several promising candidates for drugs with high binding capacity to the SARS-CoV-2 3CLpro protein, and their potential clinical repurposing is supported. Compared to drugs that are already approved for use against COVID-19—e.g., nirmatrelvir (a component of Paxlovid) and remdesivir—some of the compounds identified have higher in silico binding capacity, indicating that they could provide alternative or complementary therapeutic solutions. While drug repurposing presents great benefits concerning safety profiles and shortened development periods, it comes with challenges. Side effects like organ toxicity or unwanted immune reactions need to be thoroughly considered, especially given that these drugs were not originally intended for the inhibition of viruses. Additionally, in combination therapy, repurposed drugs have the potential to produce unforeseen drug-drug interactions, which may impact efficacy or enhance toxicity. Moreover, there are significant regulatory barriers that need to be overcome before repurposed drugs can be licensed for new indications. Even though these drugs have established safety data, regulatory bodies like the FDA still need proof of efficacy for the new indication through well-conducted clinical trials, which can be time- and resource-consuming. Notably, the disconnect between preclinical success and clinical utility remains a key obstacle. Chemicals that work well in silico or even in vitro can flop in the clinic because of variability in pharmacokinetics, patient variability, or complexity in disease. Hence, although the results here are encouraging and form a good basis for further research, translational research, and clinical testing will be crucial to achieve their therapeutic potential in the real world.

To summarize our study in a non-technical manner, we applied computer-based strategies in this research to screen for known drugs that could be effective against COVID-19 by inhibiting a critical viral protein named 3 CLpro, which is vital for the replication of the virus. By integrating molecular docking (to estimate drug-protein binding affinity) with machine learning algorithms (to refine prediction performance), we tested a filtered list of approved and clinically investigated drugs. Among the leading contenders were drugs such as paritaprevir (an antiviral drug), stanozolol (a steroid drug), and vinblastine derivatives (employed in cancer chemotherapy). These medicines demonstrated high prospects of inhibiting the activity of the viral protein. Since these drugs are already familiar in the clinical environment, they might be investigated faster in laboratory experiments and potentially progressed to clinical trials for treating COVID-19.

This research has important public health consequences, particularly for low-resource environments. By repurposing established drugs with well-characterized safety profiles, therapies can be expedited, cutting the time and expenses usually required for new drug development. This strategy could facilitate more rapid, more equitable access to effective treatments, aiding in enhancing global responses to ongoing and emerging pandemics.

Conclusion

Our study successfully identified the approved drugs to repurpose for COVID-19 treatment by targeting the SARS-CoV-2 main protease 3CLpro. Using a computational framework combining molecular docking and machine learning, we identified six drugs with high binding affinities from − 15 kcal/mol to −13 kcal/mol and favorable binding energies, indicating their potential as COVID-19 therapeutics. Our proposed Decision Tree Regression (DTR) model outperformed other regression models in predicting binding affinities. Our model has the values of R²=0.97 and RMSE = 1.68 with CDK fingerprints and the values of R²=0.97 and RMSE = 1.57 with MACCS fingerprints. Further, we analyzed the physiochemical properties of the shortlisted compounds to understand the behavior of drug molecules in biological systems and determine their efficacy and safety. Our analysis also demonstrated that the selected compounds effectively inhibit the viral enzyme 3CLpro. These findings highlight the potential of in-silico approaches for drug repurposing and provide insights into the development of potential therapeutic candidates for COVID-19.

Even with the encouraging outcomes, one key limitation of this research is its use of in silico techniques exclusively, which do not account for the full richness of biological systems, such as protein mobility, pharmacokinetics, and off-target activity. Our current study focuses on molecular docking to predict binding affinity, however, in our future work, we will include MD simulations for capture the dynamic nature of protein-ligand interactions. This will help to dynamically validating our current docking simulations and machine learning outputs. Experimental validation by in vitro and in vivo studies is also necessary to affirm the efficacy and safety of these candidates. Exploring synergistic interactions between these compounds in combination with conventional or new drugs may further amplify therapeutic benefits. Our approach presents a useful and versatile strategy extendable to other viral and non-viral illnesses where drug repurposing represents a feasible alternative. This work emphasizes the paramount importance of computer-aided tools in streamlining drug discovery, particularly with the advent of emerging pandemics and global health crises. To bridge this gap between preclinical forecasting and clinical use, in vitro and in vivo testing, given the highest priority, followed by adequate clinical trials, is necessary to establish therapeutic efficacy and guarantee patient safety.