Introduction

Although tracheostomy is crucial for severe central nervous system (CNS) injuries, it can also lead to a range of complications, including bleeding, unstable oxygen saturation, infections, and granulation tissue formation1. Therefore, extubation assessment is an important step in reducing complications and promoting functional recovery in patients. Factors such as age, swallowing ability, cough reflex, and infection status are associated with successful extubation in patients with CNS injuries2. Additionally, some studies have shown that swallowing function is a reliable indicator of extubation outcomes in stroke patients3. Most current studies on extubation predictors rely on traditional statistical analyses4,5,6, which have provided some clinical insights for extubation decisions. However, the extubation failure rate remains above 10%7. In order to improve the accuracy of extubation predictions, further research on new methods is needed.

In the Yunnan-Guizhou Plateau, Kunming is situated at a relatively high altitude, approximately around 1900 meters8. The unique characteristics of high-altitude environments include low oxygen and low air pressure conditions, which directly impact the respiratory and cardiovascular systems of patients. Richalet et al.concluded that long-term hypoxia in high-altitude areas can cause pulmonary vasoconstriction, remodeling, increase the risk of pulmonary hypertension, worsen right ventricular burden, and impact the function of the circulatory and respiratory systems9. One common reason for extubation failure in the ICU is cardiovascular system dysfunction10. Moreover, high-altitude CNS injury patients are more likely to develop severe brain edema, resulting in higher disability and mortality rates compared to individuals in plain areas11. Overall, these patients with poor prognosis may have a lower success rate in extubation and an increased risk of reintubation. However, it is currently unclear which factors will affect the extubation outcomes of CNS injury patients in high-altitude regions. Therefore, one of the aims of this study is to explore factors influencing extubation outcomes in such patients.

Both the tracheostomy decannulation process in other departments and in the rehabilitation department emphasize formalized protocols and interprofessional collaboration12,13. In acute care environments, decannulation decisions are typically driven by factors such as the patient’s ability to tolerate tracheostomy tube capping, the effectiveness of cough, and the control of secretions14. The rehabilitation departments, by contrast, incorporate comprehensive assessments that also evaluate functional status, swallowing ability, and the patient’s readiness15. Interdisciplinary tracheostomy teams, comprising rehabilitation specialists, respiratory therapists, speech-language pathologists, and nurses, improve decision-making processes and customize decannulation protocols to meet each patient’s specific needs16. Despite the comprehensive considerations of the rehabilitation department and its emphasis on personalization, there is currently a lack of research using machine learning (ML) methods to predict extubation outcomes.

Traditional statistical methods for predicting extubation typically use a small set of predetermined variables and a straightforward model structure based on linear relationships. This may overlook complex nonlinear relationships in clinical data. Modern ML techniques are capable of handling high-dimensional data without strict assumptions. They can capture complex interactions hidden in the data, leading to a significant improvement in prediction accuracy and sensitivity17. ML methods can fully utilize data from mechanical ventilators, multi-parameter monitoring devices, and physiological signals, providing a more personalized and dynamic predictive basis. At the same time, another major advantage of ML models is the development of their interpretability tools. SHAP (Shapley Additive exPlanations) can unravel complex models, quantify the impact of each input feature, and help clinical doctors understand the model’s decision logic18. This approach not only helps identify key factors influencing extubation outcomes, but also provides a theoretical basis for developing clinical interventions based on these factors.

However, there is limited research on predictive models for extubation outcomes in high-altitude regions. Therefore, the aim of this retrospective study was to determine the optimal ML models for this patient population and to identify factors associated with successful extubation in tracheostomized patients in the Plateau. There were 8 conventional models developed, as well as a Weighted Posterior Vote (WPV) ensemble model. The predictive performance of the models was subsequently compared using evaluation metrics including accuracy (ACC), area under the receiver operating characteristic curve (AUC), sensitivity, specificity, Youden’s index, positive predictive value (PPV), and negative predictive value (NPV). At last, the SHAP analysis was used to evaluate the influence of features on the models and extubation success rate.

Materials and methods

Study design

The study design followed the STROBE guidelines for a retrospective, nonsynchronous study19. The data were derived from 501 adult tracheostomy patients (aged over 18) at the Department of Rehabilitation, Second Affiliated Hospital of Kunming Medical University, between August 2016 and November 2022, with 196 successful extubations. Upon admission, patients underwent a functional assessment by skilled Clinical Assessor. Inclusion criteria: (1) Tracheostomy patients; (2) Patients undergoing extubation surgery in our department; (3) with CNS injuries; (4) aged 18 to 85 years old. Exclusion criteria: (1) Patients with unstable conditions or needing to be transferred to other departments for treatment; (2) with incomplete clinical data.

Informed consent statement

Informed consent to participate was waived by Ethics Committee of the Second Affiliated Hospital of Kunming Medical University (IRB). In cases where participants cannot be located and the research project does not involve personal privacy or commercial interests, informed consent may be waived with approval from the ethics committee, as outlined in the Ethical Review Procedures for Biomedical Research Involving Human Participants by the Chinese National Health Commission.

Ethics approval

The study design was approved by the Ethics Committee of the Second Affiliated Hospital of Kunming Medical University (Ethics approval number: Shen-PJ-ke-2023-28). The study has been performed in accordance with the Declaration of Helsinki.

Data collection

The study collected demographic, behavioral, and clinical data from tracheostomy patients: (1) demographic data, such as sex and age; (2) behavioral factors, such as smoking history, primary disease, paralysis, consciousness and cognitive status, and medical history (epilepsy, hypertension, diabetes, heart disease); (3) clinical indicators, including wet rales, dysphagia, pulmonary infection, sputum culture, sputum viscosity, upper respiratory tract structural abnormalities (URSA), and Glasgow Coma Scale (GCS); and (4) blood physiology, including procalcitonin, interleukin 6 (IL-6), high-sensitivity C-reactive protein (hs-CRP), white blood cells, neutrophils, lymphocytes, monocytes, hemoglobin, and albumin. Encode all categorical variables using label encoding. The specific numerical symbols are presented in the Supplementary Material (Table S1). All the data were managed and stored centrally by the hospital’s medical records department. Some clinical indicators’ diagnostic criteria are detailed in the supplementary materials. The data of all patients included in this study was complete, hence there were no missing values in the complete data.

Feature screening

To deal with the potential data imbalance, we executed feature selection and standardization. Logistic regression was used to identify the factors related to extubation outcomes. The analysis was conducted using the Statsmodels package in Python 3.12.4, with a significance level of 0.05. To avoid overlooking hidden information, weakly correlated variables (P ~ 0.05) were also included in the model construction20. Next, the selected feature values were standardized. The mean and standard deviation of the training set were calculated, and the training set data was transformed using zero-mean normalization method to have a distribution with a mean of 0 and standard deviation of 1. To prevent information leakage from the test set to the training phase, the test set was normalized using the mean and standard deviation calculated from the training set.

Model selection and construction

A total of eight models were constructed, including the k-nearest neighbors (K-NN), support vector machine (SVM)21, Gaussian naive Bayes (GNB), decision tree (DT), random forest (RF)22, extra tree classifier (ET), gradient boosting (GB), and logistic regression (LR). Modeling parameters were selected via cross-validation and hyperparameter optimization. Consistent data splitting was chosen to ensure comparability among multiple models. The 80%−20% split balanced model capacity with evaluation reliability, restrained overfitting, and optimized computational resources20,23. Data preprocessing and dataset splitting were carried out using the Numpy and Sklearn packages in Python 3.12.424. The details of hyperparameter optimization are provided in the Supplementary Material (Table S2). The workflow of the model construction is shown in Fig. 1.

Fig. 1
figure 1

Feature selection and model optimization and development. (A) Among all the clinical data collected in this study, the features marked with red stars were those associated with extubation outcomes and were utilized for subsequent modeling. On the left is the distribution plot of the model’s test and training sets, along with the proportion of extubation-success patients in each subset. (B) Flowchart of feature selection and model development. (C) Comparison of the ACC and AUC before and after optimization for each model. Accuracy (ACC), area under the receiver operating characteristic curve (AUC), K-nearest neighbors (K-NN), support vector machine (SVM), Gaussian naive Bayes (GNB), decision tree (DT), random forest (RF), extra tree classifier (ET), gradient boosting (GB), and logistic regression (LR).

Feature analysis

SHAP value was used to assess the significance of each feature. This function, implemented via the SHAP package in Python 3.12.425, employs a game-theoretic approach to evaluate the importance of each feature. SHAP values not only rank features by their importance to the model but also reveal the impact of each feature on patient extubation outcomes23. Only models with an AUC greater than 0.85 and an ACC greater than 80% are discussed in the discussion section.

Ensemble model construction

An ensemble model was constructed using Sklearn packages in Python 3.12.4. Emerging methods such as WPV offer a unique way to incorporate model confidence levels into the final decision, addressing issues related to class imbalance and overfitting in noisy environments26. We employed the hard classification voting ensemble in our model in detail. The hard voting scheme is suited for predicting distinct class labels, while soft voting is appropriate for predicting continuous values27. In theory, each base classifier produces a binary output (1 or 0). WPV conducts voting based on the frequency of each prediction. In the scenario where RF predicts 1, GNB predicts 1, and K-NN predicts 0, the final prediction will be 1 because it is the most common outcome. In model ensemble, the diversity and complementarity of models were taken into account28.

Results

Statistical analysis

Logistic regression was performed with extubation outcomes as the independent variable and other factors as the dependent variables. We found that the outcomes of extubation were associated with several factors: disease duration (P = 0.002), underlying condition (P < 0.001), quadriplegia (P < 0.001), GCS score (P = 0.040), swallowing function (P < 0.001), sputum culture pathogens (P = 0.013), interleukin-6 levels (P = 0.023), lymphocyte count (P = 0.046), and airway structural abnormalities (P < 0.001). To avoid omitting relevant information, features with P-values close to 0.05 were also included in the model. These included mental disorders (P = 0.05), pulmonary embolism (P = 0.087), and albumin levels (P = 0.081). The statistical analysis results are presented in Table 1.

Table 1 Logistic regression results for each variable.

Model performance

Following feature selection, all models were developed. Subsequently, hyperparameters of each model were tuned by maximizing classification accuracy. To mitigate overfitting, model parameters were further optimized through tenfold cross-validation. The ACC and AUC of the model before and after optimization ae shown in Fig. 1C. The threshold for maximizing the Youden index via ROC curve analysis is: RF (0.5), ET (0.51), K-NN (0.58), GB (0.53)29. The performance are presented in Fig. 2 and Supplementary Materials (Figs. S1–S3, Table S3).

Fig. 2
figure 2

The performance of the four models with an AUC greater than 0.85 and an ACC greater than 80%, including the confusion matrix, AUC curve, and learning curve for each model. Accuracy (ACC), area under the receiver operating characteristic curve (AUC), sensitivity (SEN), specificity (SPE), Youden’s index (YDI), positive predictive value (PPV), negative predictive value (NPV), and K-nearest neighbors (K-NN) were used.

The WPV method was employed for model integration to achieve optimal performance across all evaluation metrics. The RF, GNB, and k-NN models were integrated through this WPV approach. The ensemble model performed best when the weights of the three models were set to a 1:1:1 ratio. The model performance of the WPV is shown in Fig. 3.

Fig. 3
figure 3

Confusion matrices and AUCs of the WPV ensemble model and its submodels. Three confusion matrices of the independent models that make up the ensemble model (top), their AUC curves (bottom), evaluation metrics of the ensemble model (left middle), and the confusion matrix and AUC curve of the ensemble model (right middle). Weighted Posteriors Voting (WPV), Accuracy (ACC), Area Under the Receiver Operating Characteristic Curve (AUC), Sensitivity (SEN), Specificity (SPE), Youden’s Index (YDI), Positive Predictive Value (PPV), Negative Predictive Value (NPV), and K-Nearest Neighbors (K-NN).

Features contribution analysis

The contribution of each feature to the models and its impact on the extubation success rate were determined through the calculation of SHAP values. Swallowing dysfunction was consistently the most influential factor in all models, as patients with swallowing difficulties were often classified as negative (i.e., extubation failure). Lower GCS scores also played a significant role, leading the models to lean towards negative classifications. The SHAP values for each model are showed in Fig. 4 and the Supplementary Materials (Fig. S4).

Fig. 4
figure 4

Bar summary plots and SHAP summary plots of the four models with AUCs greater than 0.85 and ACCs greater than 80%. In the bar chart, the top three features contributing to the model are highlighted in red. In the SHAP plot, points to the left of the zero coordinate represent failed extubation, while points to the right represent successful extubation. Shapley additive explanations (SHAP), Glasgow Coma Scale (GCS), Upper Respiratory Tract Structural Abnormalities (URSA), Interleukin 6 (IL-6), Albumin (ALB), Pulmonary Embolism (PE), K-Nearest Neighbors (K-NN).

Discussion

In this study, Logistic regression was initially conducted on all variables related to extubation outcomes. Subsequently, eight ML models were developed based on these variables, and their predictive performance for extubation success was evaluated using multiple metrics. The SHAP analysis revealed that among all variables, disease duration, primary disease, tetraplegia, GCS score, swallowing function, pathogens in sputum culture, interleukin-6 levels, lymphocyte count, and abnormal respiratory structures were associated with successful extubation. Under optimal parameter selection, random forest, extra trees, k-NN, gradient boosting, and WPV all performed well.

In recent years, the use of ML models in medical prediction has been increasing, but there are significant differences in performance among different models in predicting extubation. For instance, LR is widely used due to its strong interpretability, but it has limitations in modeling complex nonlinear relationships. Wang et al. developed a predictive model using multivariable logistic regression, but its AUC was only 0.793, indicating limitations in handling high-dimensional data30. A study compared the performance of RF (AUC = 0.787), linear regression (AUC = 0.762), artificial neural network (AUC = 0.763), and SVM (AUC = 0.740). Nevertheless, none of the models achieved satisfactory results, as all AUC values were less than 0.831. In contrast, Huang et al. achieved an AUC of 0.976 with their RF model built using time series respiratory parameters, which was significantly better than SVM and GNB32. Even though it was lower than the previous study, the four models in our study that showed good performance all had AUC values above 0.85. Given the unique physiological parameters in high-altitude areas, we still believe that their performance is clinically significant. Next, their study utilized time-series ventilator-derived parameters as features, which may be more advantageous for improving the predictive power of the model.

Mixed types of feature values can impact the model’s performance. In this study, the modeling features include both continuous and categorical data values. In comparison, mixed type of data is more in line with clinical reality than single type of data. Different ML algorithms have their own advantages and limitations when dealing with mixed features. DT and tree-based models like random forests have the ability to handle categorical variables without the need for extensive preprocessing, demonstrating reliable performance with high-dimensional, heterogeneous datasets33. Continuous variables and categorical variables differ in scale, distribution, and information representation. Inputting them directly into data scale-sensitive models such as SVM and K-NN may result in the model being excessively sensitive to one type of feature, while ignoring crucial information from the other type of feature. Therefore, it is often necessary to normalize or standardize continuous variables. Techniques like one-hot encoding, label encoding, and neural embeddings convert categorical variables into a unified numerical representation, effectively addressing scale inconsistency issues during model training34,35.

Reducing the risk of overfitting and enhancing the generalization of models is crucial. Due to the different ranges and distributions of various features, it is easy to cause data bias and model overfitting issues, especially when the sample size is small36. Therefore, the hyperparameter optimization and cross-validation methods is particularly crucial to ensure that the model not only performs excellently but also possesses strong generalization capabilities37,38. Additionally, the feature selection process is especially critical for mixed data. This process focuses on retaining the most informative features while eliminating redundant and noisy ones, providing a key pathway to enhance the model’s learning efficiency39. Hence, the aforementioned methods were adopted to further strengthen the model’s performance and learning depth. Furthermore, mitigating model overfitting has also somewhat enhanced the generalization ability of each model40.

In present study, a WPV was integrated model developed. Bagging, boosting, and stacking are conventional methods that have been widely researched. Bagging functions by training base models on bootstrap samples, ultimately decreasing variance and improving model stability41,42. Meanwhile, bagging’s simplicity and ease of implementation make it a robust baseline method, although its uniform voting mechanism may not fully exploit the differing confidence levels of individual models42. Although boosting can achieve high accuracy on training data, its performance may degrade on unseen data when noise is present. Consequently, the benefits of boosting are often offset by decreased robustness in noisy environments43,44. Finally, stacking entails utilizing a meta-learner to combine the outcomes of different models, thereby capturing more intricate relationships in the data45. While stacking often provides better performance in diverse settings, its reliance on an additional meta-learning layer can result in increased complexity in model selection and parameter tuning45.

In contrast to previously mentioned methods, emerging methods such as WPV offer a new way to incorporate model confidence levels into the final decision, addressing issues related to class imbalance and overfitting in noisy environments26,46. This mechanism is a complement to the bagging and boosting frameworks and could improve ensemble performance in multiple medical domains. In present study, the WPV ensemble model showed superior performance in all evaluation dimensions compared to individual models. Given that our dataset was derived from a single center, our model may consequently have reduced generalizability. The WPV method could be utilized to partially alleviate this issue. Firstly, by combining the outputs of complementary models, ensemble models can better capture the underlying distribution of the data, alleviating the risks associated with individual models learning inadequately or underfitting26. Next, the voting mechanism has excellent scalability, making it easy to add or remove models integrated into the system. This is extremely beneficial for managing extensive data flows and intricate tasks, enabling the flexible adjustment of the model’s scale and structure according to specific requirements46.

Finally, an analysis using SHAP was conducted to evaluate the impact of each feature on the extubation outcomes. The SHAP values suggested that patients with swallowing dysfunction had a significantly lower extubation success rate. Previous retrospective studies also reported a significant correlation between swallowing function and the extubation success rate5,47,48. Damage to the glossopharyngeal (IX), vagus (X), and hypoglossal (XII) nerves could impact the muscles used in swallowing, leading to dysphagia in individuals with CNS injury49. Precise coordination of breathing and swallowing is crucial for airway protection50. Dysfunction of the swallowing musculature increases the risk of aspiration and pulmonary infection, which are common causes of extubation failure and prolonged intubation in critically ill patients51,52. Many studies have emphasized the need for clinicians to prioritize swallowing training after extubation53,54. However, the results of our study suggest that swallowing function training before extubation should also be emphasized, as it might help reduce postextubation dysfunction and decrease the likelihood of extubation failure. Training and assessing swallowing are crucial abilities in rehabilitation medicine; early intervention in rehabilitation could enhance the extubation success rate by enhancing swallowing function prior to extubation. Nevertheless, this process must highlight individualization, and specific strategies also require additional investigation. Swallowing difficulties are present in 40% of patients with quadriplegia49. They might also have problems like diaphragm imbalance and respiratory issues55,56. Given the low oxygen and low pressure environment of the plateau, patients with quadriplegia may need to focus more on their respiratory training. In severe cases, diaphragm pacemakers may be considered as interventions57.

The influence of consciousness level on extubation success rate remains controversial58. Nevertheless, GCS were consistently ranked in the top three contributors in the models’ classification. Patients with higher GCS scores showed significantly greater extubation success rates. This finding reinforces previous studies which demonstrate that consciousness level strongly correlates with extubation outcomes59,60. Furthermore, cognitive impairment contributes to swallowing disorders, aspiration risks, and subsequent pneumonia, potentially resulting in delayed extubation for patients with tracheostomies61. Impaired consciousness leads to prolonged bed rest, thereby increasing the risk of extubation failure62. If consciousness disorders are truly related to extubation outcomes, then high-altitude areas need to take this issue seriously. The low pressure and insufficient oxygen supply in high-altitude regions can lead to cerebral tissue hypoxia, resulting in vasodilation, disruption of the blood–brain barrier, and cerebral edema11,63. In extreme situations, these alterations may result in a patient quickly transitioning from a state of blurred consciousness to a coma. Therefore, the consciousness of patients with CNS in high-altitude areas should be further emphasized.

The results showed that patients with higher ALB levels were more likely to be classified as extubation failures in this both models. Following CNS damage, various forms of cell death, including apoptosis, ferroptosis, and mitochondrial dysfunction, can deplete a substantial amount of protein resources64,65. Previous research revealed that patients with a daily protein intake greater than 1.2 g/kg/day effectively presented increased serum albumin levels, and their median duration of mechanical ventilation was significantly shorter than that of those with lower protein intake66. Their conclusions emphasized that higher protein intake is beneficial for the successful extubation of patients with prolonged endotracheal intubation66,67. These findings suggest that increasing serum albumin levels through enhanced nutritional support or albumin supplementation prior to extubation may be an effective strategy to improve extubation success rates.

In our study, lymphocyte and IL-6 levels influenced extubation outcomes. The model is more likely to classify samples with lymphocytopenia and elevated IL-6 levels as having a higher risk of extubation failure. A pathway analysis study highlighted IL-6 as a key biomarker linked to inflammatory processes, with elevated levels serving as a reliable indicator of active inflammation in patients68. Moreover, another study has suggested that lymphopenia could indicate a state of chronic inflammation and prolonged immune depression in patients69. Although the exact mechanism by which lymphocytopenia impacts extubation outcomes remains unclear, patients with lymphocytopenia and immunosuppression exhibit significantly reduced extubation success rates and elevated in-hospital mortality70,71. Patients with lymphopenia were also found to present with thrombocytopenia and hypoalbuminemia, suggesting a state of systemic compromise72. This finding suggested that preextubation anti-inflammatory treatment combined with protein intake might be a potential strategy to improve extubation success rates.

This study also has several limitations. Firstly, our models demonstrated superior performance compared to previous studies, but they are not yet suitable for clinical use. More clinical research is necessary in this area. Second, the features included in this study may have been limited, and the model’s reliability could be enhanced by incorporating a more comprehensive set of features during the modeling process73. At the same time, the data collected in this study lack a time dimension, so the results lack explanatory dynamics of the disease course in tracheostomy patients. Moreover, this study did not find any distinctive features specifically associated with high-altitude pulmonary edema outcomes. In future studies, additional indicators such as alveolar oxygen partial pressure (or blood oxygen saturation), plateau pressure, blood urea nitrogen, heart rate, positive end-expiratory pressure, and creatinine will be included74,75,76. Finally, data collection from a single center may reduce the generalization ability of the model. Therefore, in future research, multicenter data collection should be adopted, as it can make the research results more convincing and increase generalizability.

Conclusion

In summary, the models developed in this study, including the random forest, extra tree classifier, k-nearest neighbor, gradient boosting, and WPV ensemble models, demonstrated strong performance in terms of the AUC and accuracy, indicating good predictive ability for outcome prediction at extubation. According to SHAP analysis, dysphagia, quadriplegia, altered consciousness, low white blood cell levels, and abnormal levels of interleukins and lymphocytes may have a negative impact on extubation outcomes. However, the data was collected from a single center, which may lead to insufficient generalizability. In the future, it is necessary to consider supplementing with data from multiple centers. Furthermore, future studies should incorporate a wider range of clinical indicators, particularly those associated with the respiratory system.