Machine learning based predictive model of the risk of Tourette syndrome with SHAP value interpretation: a retrospective observational study

Li, Aimin; Liu, Yueying; Luo, Yufan; Xiao, Xue; Xiao, Wei; Xie, Ruijin; Deng, Xianhui; Chen, Zhe; Zhou, Qian; Gong, Yue; Chen, Zhen; Xu, Hua

doi:10.1038/s41598-025-01874-7

Download PDF

Article
Open access
Published: 26 May 2025

Machine learning based predictive model of the risk of Tourette syndrome with SHAP value interpretation: a retrospective observational study

Aimin Li¹^na1,
Yueying Liu²^na1,
Yufan Luo²^na1,
Xue Xiao²,
Wei Xiao²,
Ruijin Xie^1,2,
Xianhui Deng³,
Zhe Chen¹,
Qian Zhou¹,
Yue Gong¹,
Zhen Chen¹ &
…
Hua Xu²

Scientific Reports volume 15, Article number: 18336 (2025) Cite this article

890 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Tourette syndrome is a relatively prevalent neurological condition, particularly among children, characterized by sudden, involuntary, repetitive movements or vocalizations. Contemporary diagnostic approaches for Tourette syndrome (TS) primarily rely on behavioral assessments, which pose challenges due to symptom overlap with other psychiatric disorders and significant inter-individual variability. Establishing a machine learning–based predictive model for predicting the risk of TS could potentially enhance diagnostic precision and treatment effectiveness. The investigation was conducted at the Department of Pediatrics, Affiliated Hospital of Jiangnan University, spanning the period from January 2022 to October 2024. Clinical data, encompassing complete blood counts, liver and kidney function assessments, blood glucose levels, and serum electrolyte analyses, were collected. Feature selection was conducted using Boruta and multivariable logistic regression analyses, and model construction was undertaken employing 9 distinct machine learning algorithms. 10 distinct features were selected for machine learning algorithm development, and our results indicated that the Gradient Boosting Machine algorithm is the optimal model. Our study successfully established a predictive model for the risk of Tourette syndrome using Gradient Boosting Machine, and the SHAP method highlighted the key roles of β2-microglobulin and serum 25-hydroxyvitamin D₃ levels in predicting TS risk.

An enhanced machine learning algorithm for type 2 diabetes prognosis with a detailed examination of Key correlates

Article Open access 01 November 2024

Enhanced prediction of ventilator-associated pneumonia in patients with traumatic brain injury using advanced machine learning techniques

Article Open access 02 April 2025

Personalized prediction of delayed graft function for recipients of deceased donor kidney transplants with machine learning

Article Open access 27 October 2020

Introduction

Tourette syndrome (TS) is the most common chronic tic disorder affecting children and is characterized by multiple motor tics and at least one vocal tic¹. Tic disorders typically begin in childhood, usually between the ages of 5 and 10, affecting both males and females. However, they are more prevalent among males, with research indicating that males are 3–5 times more likely to be diagnosed with Tourette syndrome compared to females². Tourette syndrome may interfere with concentration and learning, leading to academic challenges ^1,3. Despite the noticeable increase in Tourette syndrome and tic-like behaviors, particularly since the onset of the COVID-19 pandemic, underdiagnosis remains common. This underdiagnosis may result in serious consequences, including chronic stress, low self-esteem, and difficulties in personal and professional life⁴. Compared to other tic disorders, TS is often considered to affect children more significantly because TS can be more disruptive and noticeable than having only motor or vocal tics, as in other tic disorders⁵. Additionally, while other tic disorders can involve complex tics, TS is more commonly associated with them, and the combination and complexity of tics in TS can have a greater impact on a child’s social interactions, school performance, and overall quality of life⁶. Therefore, early diagnosis of tic disorders, particularly Tourette syndrome, is crucial for timely intervention and prevention of the development of more severe symptoms and complications⁷.

Currently, the diagnosis of TS is primarily based on behavioral assessments and criteria outlined in the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5)⁵. Early diagnosis of TS in clinical settings remains challenging due to several factors. Firstly, there is no blood test, brain scan, or other objective method to diagnose TS. Diagnosis relies solely on the observation and documentation of the patient’s history of tics. Secondly, certain tics, such as eye blinking or throat clearing, can be mistaken for other issues, including allergies, vision problems, or nervous habits. This can lead to misdiagnosis or delayed diagnosis⁸. Thirdly, the presence of comorbidities can complicate the diagnostic process, as symptoms may overlap or mask one another. Additionally, individuals with TS and their families may feel embarrassed or reluctant to discuss tics, especially if they are mild or infrequent, which can result in underreporting and delays in diagnosis. Therefore, this study aims to establish a machine learning–based predictive model for predicting the risk of TS. For this purpose, we collected data including complete blood counts, liver function assessments, kidney function evaluations, blood glucose levels, serum electrolyte analyses, and vitamin D level. These indicators were selected not only because they are widely accessible across all healthcare facility levels in China but also due to their high cost-effectiveness, making them particularly suitable for large-scale screening of TS⁹. This study will not only enhance our understanding of the biological basis of TS but also improve treatment strategies for timely intervention and management.

Methods

Study design

As depicted in Fig. 1, this preliminary retrospective observational pilot cohort study was conducted at the Department of Pediatrics, Affiliated Hospital of Jiangnan University, spanning the period from January 2022 to October 2024, building upon our previous research^10,11. The TS group encompassed children aged 8–10 years diagnosed with TS, while the control group consisted of healthy children admitted for routine check-ups. Table 1 outlines the fundamental clinical characteristics of both cohorts. Exclusion criteria for the study were children not diagnosed with TS, those with concurrent psychiatric disorders, and those with incomplete clinical or laboratory data. To facilitate practical clinical application of the predictive model, we focused on disparities in complete blood count, liver function assessments, kidney function evaluations, blood glucose levels, and serum electrolyte analyses. These data were gathered within the initial 24-h period following admission to the department. Initially, a total of 50 features were investigated (Table 1).

Table 1 The basic clinical characteristics of participants involved in this study.

Full size table

Ethical statement

This pilot observational retrospective study adhered to the tenets of the Declaration of Helsinki. The study protocol was developed based on our previous research and received ethical approval from the Institutional Review Board of the Affiliated Hospital of Jiangnan University (ID: LS2024KY721)¹². Participants were thoroughly informed about the study’s objectives, procedures, potential risks, and benefits and all participants were provided informed consent during the study. To uphold confidentiality, all collected data were anonymized and securely stored, ensuring the privacy of participant information.

Clinical diagnostic criteria for Tourette syndrome

The clinical diagnostic criteria for Tourette syndrome are based on the DSM-5¹³. The detailed information on the clinical diagnostic criteria for Tourette syndrome, including specific symptom lists and exclusion criteria, is shown in Supplemental File 1.

Feature selection

As illustrated in Fig. 1, we performed feature selection combined using Boruta and multivariable logistic regression analyses, drawing upon methodologies from our prior investigations¹⁴. As illustrated in Fig. 2, We selected important features from the Boruta algorithm by identifying confirmed important variables. Multivariable logistic regression was subsequently performed to further refine feature selection and to determine the odds ratios (OR) with their respective confidence intervals for each significant predictor. Features with OR > 1 indicate an increased likelihood of Tourette syndrome, while OR < 1 suggests a decreased likelihood^10,15. To mitigate bias arising from multicollinearity, we conducted both Spearman correlation and variance inflation factor (VIF) analyses and eliminated variables exhibiting significant correlations (|R|> 0.8) or VIF > 10, adhering to the methodology delineated in the aforementioned study¹⁶. The raw R scripts, along with detailed information regarding the feature selection analyses, are openly available in our GitHub repository: (https://github.com/PediatricLab-Jiangnan/TS-Syndrome-Brouta-ML-Web).

Model construction and evaluation

As illustrated in Fig. 3, we combined the Boruta algorithm with multivariable logistic regression analysis to select relevant features, effectively to mitigate overfitting¹⁷. Consequently, a total of 10 features were identified for the development of predictive models. Before applying advanced machine learning (ML) techniques, we established a traditional predictive nomogram model and compared the Receiver Operating Characteristic (ROC) curves for the combination of 10 features as well as individual features. Subsequently, we employed 9 distinct ML techniques to predict the risk probability of Tic disorder using the selected 10 features: 1. Adaptive Boosting (AdaBoost), 2. Lasso Regression (Lasso), 3. Random Forest (RF), 4. Gradient Boosting Machine (GBM), 5. Extreme Gradient Boosting (XGBoost), 6. Support Vector Machine (SVM), 7. K-Nearest Neighbors (KNN), 8. Multilayer Perceptron (MLP), and 9. Decision Tree (DT). The use of multiple models facilitated a comparison of their performance and the identification of the most effective technique for predicting the risk of Tourette syndrome. We implemented tenfold cross-validation to rigorously assess model performance, wherein the entire dataset was partitioned into 10 subsets, with each subset iteratively serving as the validation set while training occurred on the remaining nine subsets. This methodology ensured thorough evaluation without necessitating a separate dataset split. Model reliability was assessed using four primary metrics: the area under the receiver operating characteristic curve (AUC), Precision-Recall Curve (PRC), Decision Curve Analysis (DCA), and calibration curve¹⁶. Additionally, specificity, sensitivity, accuracy and F1-score were presented using a radar chart. The raw Python scripts, along with detailed information regarding the feature selection analyses, are openly available in our GitHub repository: (https://github.com/PediatricLab-Jiangnan/TS-Syndrome-Brouta-ML-Web).

Model interpretation

Interpreting machine learning models has consistently been a challenge. To further elucidate how each feature influences and contributes to the final model, we employed the SHapley Additive exPlanation (SHAP) technique to interpret the highest-performing black-box model, as established in prior studies and illustrated in Fig. 4. In this study, we evaluated the importance of each feature by computing the mean absolute SHAP values. Additionally, we plotted the SHAP values for each feature across samples to better understand the overall patterns and the extent of each feature’s impact on Tourette syndrome risk. We also utilized the SHAP dependency plots to assess the influence of each feature. Furthermore, we provided one illustrative one example of SHAP predictions for demonstration purposes.

Web application development

To facilitate the utility of the model in clinical settings, the final prediction model was implemented into a web application developed using the Streamlit Python framework and available at https://touretterisk.streamlit.app/.

Mendelian randomization analysis

Mendelian Randomization (MR) analysis is a robust method that leverages genome-wide association study (GWAS) data and genetic variants to investigate potential causal relationships between modifiable exposures and various diseases or outcomes¹⁸. GWAS identifies genetic variants associated with specific traits, diseases, or outcomes, and these genetic variants—particularly single-nucleotide polymorphisms (SNPs)—serve as instrumental variables (IVs) in MR studies¹⁹. MR analysis has been extensively utilized to evaluate the potential effects of diverse exposures on disease risk¹⁹. In this study, we conducted Mendelian Randomization analysis based on previously established methodologies to explore the potential associations between Tourette syndrome and candidate biomarkers identified through machine learning analyses^20,21,22. Detailed criteria for the MR analysis are provided in Supplemental File 2, and comprehensive information regarding the GWAS datasets employed in this study can be found in Table 2. Specifically, GWAS data for TS were sourced from the Psychiatric Genomics Consortium (PGC) (https://pgc.unc.edu/), while GWAS data for the candidate biomarkers selected by ML were obtained from the IEU OpenGWAS project (https://gwas.mrcieu.ac.uk/).

Table 2 Detailed information of GWAS data used in the study.

Full size table

Statistical analysis

This study presents data as mean ± standard deviation (SD). The Shapiro–Wilk test evaluated the normality of the data distribution. For normally distributed data, unpaired t-tests, one-way ANOVA, or two-way ANOVA were employed for statistical comparisons, followed by Tukey’s post-hoc tests for multiple comparisons. Non-normally distributed data were analyzed using the Mann–Whitney U test for two groups or the Kruskal–Wallis test for multiple groups. All statistical analyses were conducted utilizing Python (https://www.python.org) and R (https://cran.r-project.org/). A P-value less than 0.05 was deemed statistically significant.

Results

Data processing and features selection

As depicted in Fig. 1 and Table 1, after selection, a total of 671 volunteer participants were involved, with no significant differences in age (P > 0.05), while significant differences were observed in sex and education between the groups (P < 0.05). Although educational attainment differed significantly between the groups (P < 0.05), it does not directly influence tic disorder risk. Among the 50 clinical features we collected, we observed higher levels of β2-microglobulin and lower levels of 25-hydroxyvitamin D in Tourette syndrome patients compared to healthy controls (Table 1). Spearman correlation and variance inflation factor (VIF) analyses were performed to identify and remove features contributing to multicollinearity. In this study, features with spearman correlations exceeding an absolute value of 0.8 or VIF values above 10 were excluded to minimize redundant information (Fig. 2A–B)²³.

Subsequently, the Boruta algorithm was employed to identify significant features by comparing them to shadow features. Features consistently deemed more important than their shadow counterparts are depicted in green (Fig. 2C). The iterative variable importance scores generated by the Boruta algorithm further illustrate the selection process, demonstrating the consistency of feature selection and confirming the stability of feature importance rankings (Fig. 2D). It should be noted that although the Boruta algorithm indicated that early symptoms such as eye blinking, involuntary movements, and unintentionally uttered sounds were the most important features in predicting the risk probability of TS, a complete collinearity was observed between the early symptoms and the diagnostic status. All controls were asymptomatic, whereas all TS patients exhibited these symptoms. To avoid the confounding effect of multicollinearity due to early symptoms and to better assess the independent predictive value of other biomarkers, the early symptoms variable was removed prior to machine learning modeling. A forest plot quantifies the impact of the selected features on the outcome (Fig. 2E). Notably, the multivariable logistic regression analysis revealed a high OR for β2-microglobulin and a low OR for 25-hydroxyvitamin D₃, underscoring their potential roles in predicting the risk probability of Tourette syndrome. Finally, as shown in Fig. 2F, a Venn diagram was created to illustrate the overlap of features between the Boruta algorithm and multivariable logistic regression analysis. 10 features were chosen for machine learning model development.

Machine learning algorithm development and interpretation

Following feature selection using the Boruta algorithm and multivariable logistic regression analysis, 10 features were chosen for model development. As shown in the supplemental File 2 and Fig. 3A, Prior to implementing advanced machine learning (ML) techniques, we established a conventional predictive nomogram and compared Receiver Operating Characteristic (ROC) curves for both the composite 10-feature model and individual features independently. The analysis demonstrated that the combined 10-feature model exhibited superior predictive performance compared to any single feature in isolation. In this study, we employed nine widely used machine learning algorithms, including Adaptive Boosting (AdaBoost), Lasso Regression (Lasso), Random Forest (RF), Gradient Boosting Machine (GBM), Extreme Gradient Boosting (XGBoost), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), and Decision Tree (DT), to develop predictive models and select the optimal model. Among the nine machine learning algorithms evaluated, the Gradient Boosting Machine (GBM) demonstrated exceptional performance, achieving the highest area under the curve (AUC) in ROC analysis (0.94), PRC analysis (0.97), and superior performance in DCA analyses. Additionally, GBM ranked among the top two algorithms in calibration curve assessment. As evidenced in Fig. 3F–I Further evaluation through radar chart analysis revealed that GBM attained the highest accuracy (0.872), F1-score (0.907), sensitivity (0.952), and specificity (0.719) among all tested algorithms. Based on this comprehensive performance assessment, GBM was determined to be the optimal predictive model for this study.

To enhance the interpretability of the GBM prediction model, we applied the Shapley Additive Explanations (SHAP) method, as described in a prior study¹⁶. This method elucidates how variable features influence the model’s output regarding Tourette syndrome risk. As depicted in Fig. 4A, the summary dot plot provides a comprehensive overview of each feature’s impact on the model’s output. The x-axis represents the SHAP value, indicating the magnitude and direction of impact on the prediction, where positive values increase the predicted Tourette syndrome risk and negative values decrease it. Features on the y-axis are ranked by importance based on their average absolute SHAP values. The color gradient illustrates how a feature’s value affects its impact: red represents higher values and blue represents lower values. In our analysis, we observed distinct patterns that reveal important clinical insights. For 25-hydroxyvitamin D, high values (red dots) predominantly appear on the left side (negative SHAP value area), indicating that higher levels of this vitamin significantly decrease the predicted risk of Tourette syndrome. Conversely, for β2-microglobulin, low values (blue dots) appear on the right side (positive SHAP value area), suggesting that lower levels of this protein actually increase the predicted risk of Tourette syndrome. The summary bar plot (Fig. 4B) illustrates the average absolute SHAP value for each feature, with longer bars representing more influential features. The feature importance donut chart (Fig. 4C) displays the relative importance of each feature as a percentage of the total importance, with larger slices indicating more influential features. These results emphasize the critical roles of serum 25-dihydroxyvitamin D levels and β2-microglobulin in predicting Tourette syndrome risk. The waterfall chart illustrates how each feature contributes to the shift in prediction from the baseline value (average prediction, E[f(x)]) to the final prediction (f(x)) for a given instance. The contributions of each feature are represented by horizontal bars, where blue indicates a negative effect, and red signifies a positive effect. In this study, the visualization of high-risk TS samples shows how each feature value moves the prediction from the base value (1.369) to the final expected value (2.91) (see Fig. 4D). The numbers accompanying the variables represent the actual values for the case under analysis; for example, a value of 1.94 indicates that an increase of 1.94 mg/L in blood β2-microglobulin raises the risk probability of TS prediction by 0.9. The SHAP dependence plots for blood β2-microglobulin (Fig. 4E) and 25-hydroxyvitamin D (Fig. 4F) provided detailed insights into the relationship between each feature’s value and its SHAP value, with SHAP values above zero indicating an increased risk of Tourette syndrome²⁴. In this study, our results indicated that higher levels of blood β2-microglobulin generally have a positive impact (increased Tourette syndrome risk), whereas higher levels of 25-hydroxyvitamin D have a negative impact (decreased Tourette syndrome risk). Finally, we performed a Mendelian Randomization analysis to investigate potential relationships between the selected biomarkers and Tourette Syndrome symptom severity (Fig. 5A). The results indicated a positive correlation between serum β2-microglobulin levels and TS symptoms, suggesting that elevated β2-microglobulin may exacerbate symptom severity (Fig. 5B). Conversely, we observed negative correlations between both vitamin D level and albumin and TS symptom severity (Fig. 5C and D). Furthermore, we established a web application to enhance the clinical value of our study (Fig. 6). In summary, our study successfully established a predictive model for Tourette syndrome risk using GBM. The SHAP method further emphasized the key roles of serum 25-dihydroxyvitamin D₃ and β2-microglobulin in predicting Tourette syndrome risk.

Discussion

Tic disorders are neurological conditions characterized by sudden, repetitive, non-rhythmic movements or vocalizations known as tics. These tics typically emerge during childhood, most commonly between the ages of 5 and 10, and are more prevalent in boys than in girls²⁵. Individuals with tics often experience a premonitory urge or sensation that is temporarily alleviated by performing the tic²⁶.The severity of these symptoms may increase during periods of stress, anxiety, excitement, or life transitions, while they often decrease when the individual is distracted or engaged in focused activities.

Tic disorders are classified into several main types based on the nature of the tics and their duration. Among these, Tourette syndrome is a relatively common chronic tic disorder characterized by the presence of both motor and vocal tics. Compared to other tic disorders, Tourette syndrome tends to have a more pronounced impact on children, largely due to the combined effect of motor and vocal tics, the frequent coexistence of associated conditions, and the potential for significant social, emotional, and academic challenges⁵. However, the early diagnosis of Tourette syndrome in clinical settings remains a significant challenge. The diagnosis depends entirely on the observation and detailed documentation of the patient’s history of tics, as there are currently no blood tests or other objective methods available to confirm the condition. Given these diagnostic difficulties, the identification of reliable diagnostic models is critical to enhance the accuracy and timeliness of TS detection. Although some children, including those with TS, may still exhibit resistance to any form of blood collection, blood-based biomarkers are increasingly being investigated for their potential to predict the risk probability of TS. These biomarkers offer promise for facilitating early detection, enhancing our understanding of the disorder, and supporting the development of novel therapeutic strategies²⁷.

Recently, machine learning methods using blood-based biomarkers offer a non-invasive, cost-effective approach for the early diagnosis and classification of neurological disorders including TS. For instance, Brügge et al. developed algorithms for automatic tic detection using video recordings, achieving F1 scores of 82.0% with a Random Forest classifier and 79.5% with a deep neural network²⁸. Conelea et al. applied computer vision and deep learning techniques to detect eye tics in video recordings, achieving an area under the receiver operating characteristic curve of 0.89 for random split regimens²⁹. Cernera et al. demonstrated the use of wearable sensors to capture physiological differences between tic and voluntary movements, achieving high accuracy (96.69%) in classifying movements³⁰. While these approaches provide valuable quantitative insights, their integration into routine clinical practice and intervention studies is hindered by the requirement for specialized equipment and controlled environments, posing challenges for implementation in everyday clinical settings. Furthermore, the acquisition and annotation of real clinical data demand significant expertise and time investment, which limits the creation of large-scale datasets necessary for robust model development and validation.

In contrast to prior research on Tourette Syndrome (TS), which has predominantly concentrated on brain function and neurotransmitter differences due to its classification as a neurodevelopmental disorder, this study seeks to develop a predictive model for assessing TS risk using routine, nonspecific, low-cost, and widely accessible testing parameters. Financial considerations are a critical factor in patients’ medical decision-making processes. Research indicates that out-of-pocket expenses significantly influence choices regarding healthcare services and treatments, particularly for guardians of children with neurological disorders in China^31,32. The American College of Cardiology and the American Heart Association have recognized that economic implications are inherent in medical recommendations, and disregarding costs can impose financial burdens on patients, violating the fundamental principle of "do no harm."³³ Moreover, unlike specialized and costly neurotransmitter-related marker testing for TS, routine blood tests are predominantly covered by China’s two primary insurance schemes—the Urban Resident Basic Medical Insurance (URBMI) and the New Rural Cooperative Medical Scheme (NRCMS)—which provide coverage for children in urban and rural areas³⁴. This approach has the potential to bridge coverage gaps, especially for children from migrant families and those with lower socioeconomic status. Consequently, it aligns well with the healthcare context in China, offering the dual benefit of enhancing early identification rates and diagnostic accuracy for TS while reducing diagnostic barriers and associated costs³⁵.

In this investigation, we initially addressed multicollinearity using Spearman correlation and variance inflation factor analyses, followed by the employment of the Boruta algorithm and multivariable logistic regression analyses to select a total of 10 features, including serum 25-dihydroxyvitamin D and β₂-microglobulin, as final predictors of Tourette Syndrome risk. Subsequently, we utilized nine distinct machine learning algorithms to develop the optimal predictive model based on the comparison of the ROC, PRC, DCA and Calibration curves. Additionally, radar chart analysis further indicated that GBM as the optimal predictive model. The Shapley Additive Explanations (SHAP) method, recognized as one of the most effective approaches for identifying attributions in machine learning models, was utilized to elucidate the model. Kim et al. have explored SHAP applications in neuroimaging, highlighting its promising potential in neuroscience research³⁶. In our study, the SHAP summary dot plot and summary bar plot highlighting the pivotal roles of 25-dihydroxyvitamin D and β2-microglobulin. The SHAP dependence plots illustrated the relationship between the selected feature values and their impact on tic disorder risk. Our findings indicated that deficiencies in 25-dihydroxyvitamin D or elevated levels of β2-microglobulin may increase the risk of TS. For all the potential biomarkers selected by ML in this study, recent evidence has underscored the crucial roles of β2-microglobulin and vitamin D in various neurological disorders. Research indicates that β2-microglobulin is a component of the major histocompatibility complex class I (MHC I), which plays a key role in the immune system’s ability to distinguish “self” from "non-self"³⁷ . Additionally, β2-microglobulin is a marker of inflammation and increased cell turnover and immune system activation during inflammation lead to elevated β2-microglobulin levels. And there is growing evidence suggesting that inflammation, particularly neuroinflammation, may play a role in the development or exacerbation of Tourette syndrome³⁸. Regarding vitamin D, research indicates that it plays a crucial role in regulating the immune system, exhibits anti-inflammatory properties, and can influence both innate and adaptive immune responses³⁹. Vitamin D is also involved in synthesizing neurotransmitters such as dopamine, which is implicated in TS. Several studies have reported lower vitamin D levels in children with TS compared to healthy controls⁴⁰. For example, Li et al. observed significantly lower serum 25-hydroxyvitamin D levels in children with tic disorders, including TS, compared to healthy controls, and noted a negative correlation between serum vitamin D levels and tic severity. In contrast, Bond et al. reported that lower vitamin D levels were not directly associated with tic presence or severity but were related to the presence and severity of comorbid ADHD in children and adolescents with chronic tic disorders³⁵. It is important to note that there is currently no established relationship between other biomarkers, such as uric acid and serum phosphorus levels, and TS. Uric acid is a metabolic byproduct produced during purine breakdown, and elevated levels are primarily associated with conditions such as gout, kidney disease, metabolic syndrome, cardiovascular disease, and hypertension⁴¹. Mean corpuscular volume is a measure of the average volume of red blood cells, commonly assessed through complete blood count tests⁴². Serum phosphorus is essential for various physiological functions, including bone health, energy production, and cellular membrane integrity; however, direct evidence linking phosphorus levels to neurological processes underlying TS is lacking⁴³. Platelet distribution width, a measure reflecting platelet size variability, may indicate platelet activation and overall platelet health⁴⁴. Previous research suggests that platelet function abnormalities could reflect broader neurochemical dysregulations in TS. Thyroid hormones, including thyroxine, play critical roles in brain development and function, and alterations in thyroid hormone levels can impact various mental disorders. Prealbumin, also known as transthyretin, primarily serves as a marker of nutritional status and liver function. Eosinophils, a type of white blood cell, are primarily involved in allergic reactions and parasitic infections. Although previous studies have indicated a higher prevalence of allergic conditions among TS patients, there remains insufficient evidence to establish eosinophil levels as a significant factor in TS pathophysiology. Thus, it is essential to emphasize that the etiology of TS is complex, involving genetic factors, brain structural variations, and neurotransmitter imbalances. Relying on speculative associations without robust evidence can be misleading. Therefore, the predictive model developed in this study should serve as a supportive tool alongside comprehensive clinical assessments, and further validation studies are warranted.

Conclusion

In conclusion, we successfully developed a machine learning (ML) model to predict the risk of Tourette syndrome based on clinical data readily accessible from real-world clinical practice. Among the 9 different machine learning algorithms evaluated in this study, the Gradient Boosting Machine model exhibited superior performance. Additionally, the SHAP method was employed to elucidate the ML model. This approach not only facilitated the determination of the importance of each feature in the model but also demonstrated how each feature influenced the model’s predictions. Our study established a predictive model for tic disorder risk using GBM, with SHAP analysis revealing that 25-dihydroxyvitamin D and β2-microglobulin emerged as the two most important features for predicting Tourette syndrome risk. This comprehensive approach provides valuable insights into the biological factors potentially underlying this neurodevelopmental disorder.

Limitation

There are several limitations to this study. First, this study was conducted at a single center in China and involved Chinese populations, so the findings may not be directly applicable to other populations or ethnicities. Second, distinguishing persistent tics characteristic of Tourette syndrome from transient developmental phenomena or other neurodevelopmental disorders can be challenging in young children; therefore, potential biases due to diagnostic uncertainty or misclassification cannot be completely ruled out. Third, although the machine learning-based prediction model was developed using a relatively large sample size of 694 participants, all features in the model are non-specific laboratory biomarkers. This could lead to false positives when applied to the general population, and external validation studies are necessary to confirm the model’s generalizability and robustness. Additionally, it must be point out that our study did not specifically investigate the participants’ history of streptococcal infection and the observed correlation between elevated β2-microglobulin levels and increased Tourette syndrome risk is reminiscent of findings in Pediatric Autoimmune Neuropsychiatric Disorders Associated with Streptococcal infections (PANDAS), which also present with tic-like movements similar to Tourette syndrome⁴⁵. Furthermore, the pilot prospective observational cohort study had a relatively short follow-up period, and we did not explore the effects of different hyperparameters on model performance or adopt ensemble modeling techniques, potentially overlooking optimal configurations that could enhance the model’s accuracy and efficiency. Additional longitudinal data and extended follow-up assessments are required to provide stronger evidence supporting the predictive model’s utility in assessing tic disorder risk. Fourth, Tourette syndrome is a complex neurodevelopmental condition, and conducting machine learning models without involving neurotransmitters and functional neuroimaging markers may overlook critical biological mechanisms underlying Tourette syndrome. Further research exploring additional biomarkers could help obtain a more comprehensive understanding of its pathogenesis. Finally, before the predictive model can be broadly implemented in clinical practice, further evaluation of its clinical utility and cost-effectiveness in real-world settings is essential.

Data availability

The raw Python and R scripts, along with detailed information regarding the feature selection analyses, are openly available in our GitHub repository: (https://github.com/PediatricLab-Jiangnan/TS-Syndrome-Brouta-ML-Web).

Reference:s

Andrén, P. et al. European clinical guidelines for Tourette syndrome and other tic disorders-version 2.0. Part II: Psychological interventions. Eur. Child Adolesc. Psychiatry 31(3), 403–423 (2022).
Article PubMed Google Scholar
Müller-Vahl, K. R. et al. European clinical guidelines for Tourette syndrome and other tic disorders: Summary statement. Eur. Child Adolesc. Psychiatry 31(3), 377–382 (2022).
Article PubMed Google Scholar
Vermilion, J. & Mink, J. W. Tic disorders. Pediatr. Rev. 44(5), 294–296 (2023).
Article PubMed Google Scholar
Heyman, I., Liang, H. & Hedderly, T. COVID-19 related increase in childhood tics and tic-like attacks. Arch. Dis. Child 106(5), 420–421 (2021).
Article PubMed Google Scholar
Johnson, K. A. et al. Tourette syndrome: clinical features, pathophysiology, and treatment. Lancet Neurol. 22(2), 147–158 (2023).
Article PubMed Google Scholar
Baizabal-Carvallo, J. F., Cavanna, A. E. & Jankovic, J. Tics emergencies and malignant tourette syndrome: Assessment and management. Neurosci. Biobehav. Rev. 159, 105609 (2024).
Article PubMed Google Scholar
Ruhrman, D., Mikulincer, M., Apter, A., Benaroya-Milshtein, N. & Steinberg, T. Emotion regulation and tic disorders in children. Eur. Child Adolesc. Psychiatry 32(5), 893–902 (2023).
Article CAS PubMed Google Scholar
Pringsheim, T. et al. European Society for the Study of Tourette Syndrome 2022 criteria for clinical diagnosis of functional tic-like behaviours: International consensus from experts in tic disorders. Eur. J. Neurol. 30(4), 902–910 (2023).
Article PubMed Google Scholar
Yoon, S. H., Kim, G. Y., Choi, G. T. & Do, J. T. Organ abnormalities caused by turner syndrome. Cells 12(10), 1365 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chang, Y. et al. Investigating the potential risk of cadmium exposure on seizure severity and anxiety-like behaviors through the ferroptosis pathway in epileptic mice: An integrated multi-omics approach. J. Hazard Mater. 480, 135814 (2024).
Article CAS PubMed Google Scholar
Xie, R. et al. Quercetin alleviates kainic acid-induced seizure by inhibiting the Nrf2-mediated ferroptosis pathway. Free Radic. Biol. Med. 191, 212–226 (2022).
Article CAS PubMed Google Scholar
Mei, H. et al. Effect of atomoxetine on behavioral difficulties and growth development of primary school children with attention-deficit/hyperactivity disorder: A prospective study. Children (Basel) 9(2), 212 (2022).
PubMed Google Scholar
Koutsoklenis, A. & Honkasilta, J. ADHD in the DSM-5-TR: What has changed and what has not. Front Psychiatry 13, 1064141 (2022).
Article PubMed Google Scholar
Guo, X. et al. Exploratory analysis of the association between organophosphate ester mixtures with high blood pressure of children and adolescents aged 8–17 years: Cross-sectional findings from the national health and nutrition examination survey. Environ. Sci. Pollut. Res. Int. 30(9), 22900–22912 (2023).
Article CAS PubMed Google Scholar
Mei, H. et al. PM(2.5) exposure exacerbates seizure symptoms and cognitive dysfunction by disrupting iron metabolism and the Nrf2-mediated ferroptosis pathway. Sci. Total Environ. 910, 168578 (2024).
Article CAS PubMed Google Scholar
Hu, J. et al. Identification and validation of an explainable prediction model of acute kidney injury with prognostic implications in critically ill children: a prospective multicenter cohort study. EClinicalMedicine 68, 102409 (2024).
Article PubMed PubMed Central Google Scholar
Li, X. H., Yang, X. L., Dong, B. B. & Liu, Q. Predicting 28-day all-cause mortality in patients admitted to intensive care units with pre-existing chronic heart failure using the stress hyperglycemia ratio: a machine learning-driven retrospective cohort analysis. Cardiovasc. Diabetol. 24(1), 10 (2025).
Article CAS PubMed PubMed Central Google Scholar
Sanderson, E. et al. Mendelian randomization. Nat. Rev. Methods Prim. https://doi.org/10.1038/s43586-021-00092-5 (2022).
Article Google Scholar
Lin, L. et al. Mendelian randomization with refined instrumental variables from genetic score improves accuracy and reduces bias. Front Genet. 12, 618829 (2021).
Article PubMed PubMed Central Google Scholar
Liu, Y. et al. Analysis of the role of PANoptosis in seizures via integrated bioinformatics analysis and experimental validation. Heliyon 10(4), e26219 (2024).
Article CAS PubMed PubMed Central Google Scholar
Xie, Q. & Hu, B. Effects of gut microbiota on prostatic cancer: A two-sample Mendelian randomization study. Front Microbiol. 14, 1250369 (2023).
Article PubMed PubMed Central Google Scholar
Cao, Y. et al. Effects of chronic low-level lead (Pb) exposure on cognitive function and hippocampal neuronal ferroptosis: An integrative approach using bioinformatics analysis, machine learning, and experimental validation. Sci. Total Environ. 917, 170317 (2024).
Article CAS PubMed Google Scholar
Lukman, A. F., Mohammed, S., Olaluwoye, O. & Farghali, R. A. Handling multicollinearity and outliers in logistic regression using the robust Kibria-Lukman estimator. Axioms 14(1), 19 (2025).
Article Google Scholar
Yi, F. et al. XGBoost-SHAP-based interpretable diagnostic framework for alzheimer’s disease. BMC Med. Inform. Decis. Mak. 23(1), 137 (2023).
Article PubMed PubMed Central Google Scholar
Liu, Z. S. et al. Current status, diagnosis, and treatment recommendation for tic disorders in China. Front Psychiatry 11, 774 (2020).
PubMed PubMed Central Google Scholar
Ludlow, A. K., Anderson, S., Hedderly, T., Black, K. J. & Conelea, C. A. Editorial: Recent advances in understanding Tourette syndrome, tic disorders and functional tics. Front Psychiatry 14, 1185489 (2023).
Article PubMed PubMed Central Google Scholar
Alcolea, D., Beeri, M. S., Rojas, J. C., Gardner, R. C. & Lleó, A. Blood biomarkers in neurodegenerative diseases: Implications for the clinical neurologist. Neurology 101(4), 172–180 (2023).
Article CAS PubMed PubMed Central Google Scholar
Brügge, N. S. et al. Automated motor tic detection: A machine learning approach. Mov. Disord. 38(7), 1327–1335 (2023).
Article PubMed Google Scholar
Conelea, C. et al. Automated quantification of eye tics using computer vision and deep learning techniques. Mov. Disord. 39(1), 183–191 (2024).
Article PubMed Google Scholar
Cernera, S. et al. The human tic detector: An automatic approach to tic characterization using wearable sensors. Clin. Neurophysiol. 134, 102–110 (2022).
Article PubMed Google Scholar
Xu, W. & Huang, C. The impact of health insurance integration on patient choice of providers in China. Appl. Econ. https://doi.org/10.1080/00036846.2024.2423899 (2024).
Article Google Scholar
Li, H. Y. & Zhang, R. X. Analysis of the structure and trend prediction of China’s total health expenditure. Front Public Health 12, 1425716 (2024).
Article PubMed PubMed Central Google Scholar
Larkin, J. et al. How do people with multimorbidity prioritise healthcare when faced with tighter financial constraints? A national survey with a choice experiment component. BMC Prim. Care 26(1), 57 (2025).
Article PubMed PubMed Central Google Scholar
Dong, W., Zwi, A. B., Bai, R., Shen, C. & Gao, J. Benefit of China’s social health insurance schemes: Trend analysis and associated factors since health reform. Int. J. Environ. Res. Public Health 18(11), 5672 (2021).
Article PubMed PubMed Central Google Scholar
Jiang, Y. et al. Biomarkers and Tourette syndrome: A systematic review and meta-analysis. Front Neurol. 15, 1262057 (2024).
Article PubMed PubMed Central Google Scholar
Kim, J. H., De Asis-Cruz, J. & Limperopoulos, C. Separating group- and individual-level brain signatures in the newborn functional connectome: A deep learning approach. Neuroimage 299, 120806 (2024).
Article CAS PubMed Google Scholar
Rutkowski, D. et al. Epidermal growth factor receptor/mitogen-activated kinase inhibitor treatment induces a distinct inflammatory hair follicle response that includes collapse of immune privilege. Br. J. Dermatol. 191(5), 791–804 (2024).
Article PubMed Google Scholar
Martino, D., Dale, R. C., Gilbert, D. L., Giovannoni, G. & Leckman, J. F. Immunopathogenic mechanisms in tourette syndrome: A critical review. Mov. Disord. 24(9), 1267–1279 (2009).
Article PubMed PubMed Central Google Scholar
Dey, S. K. et al. Implications of vitamin D deficiency in systemic inflammation and cardiovascular health. Crit. Rev. Food Sci. Nutr. 64(28), 10438–10455 (2024).
Article CAS PubMed Google Scholar
Mirarchi, A., Albi, E., Beccari, T. & Arcuri, C. Microglia and brain disorders: The role of vitamin D and its receptor. Int. J. Mol. Sci. 24(15), 11892 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wen, S., Arakawa, H. & Tamai, I. Uric acid in health and disease: From physiological functions to pathogenic mechanisms. Pharmacol. Ther. 256, 108615 (2024).
Article CAS PubMed Google Scholar
Choi, H. et al. Mean corpuscular volume as a prognostic factor for 30-day mortality in major trauma patients: A retrospective cohort study. Sci. Rep. 14(1), 3951 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Sasaki, S. et al. Association between disturbed serum phosphorus levels and QT interval prolongation. Kidney Int. Rep. 9(6), 1792–1801 (2024).
Article PubMed PubMed Central Google Scholar
Ligi, D. et al. Platelet distribution width (PDW) as a significant correlate of COVID-19 infection severity and mortality. Clin. Chem. Lab. Med. 62(3), 385–395 (2024).
Article CAS PubMed Google Scholar
La Bella, S. et al. Pediatric autoimmune neuropsychiatric disorders associated with streptococcal infections (PANDAS): Myth or reality? The state of the art on a controversial disease. Microorganisms 11(10), 2549 (2023).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We would like to express our sincere gratitude to all the children with Tourette syndrome who generously contributed their time and effort to our pilot cohort. Their participation and dedication have been crucial to the completion of this research. We also extend our heartfelt thanks to their parents and guardians for their trust and support throughout the study. We are deeply appreciative of the medical staff and research assistants who played vital roles in the execution of this project. Their expertise and commitment were invaluable to our research process.

Funding

This study was supported by a grant from the National Natural Science Foundation of China awarded to Yueying Liu (No. 82371462) and sponsored by the Qing Lan Project of Jiangsu to Ruijin Xie (JS2023-27). Further support was provided by the Jiangsu Province Education Science Planning Project to Aimin Li (2023JSJG207 and B2023-02-62). We gratefully acknowledge this financial support, which made our research possible. We would like to thank the School of Medicine, Jiangnan University for providing the facilities and resources necessary for conducting this study.

Author information

Aimin Li, Yueying Liu and Yufan Luo: Contributed equally to this work.

Authors and Affiliations

Yangzhou Polytechnic College, Yangzhou, China
Aimin Li, Ruijin Xie, Zhe Chen, Qian Zhou, Yue Gong & Zhen Chen
Department of Pediatrics, Affiliated Hospital of Jiangnan University, Wuxi, China
Yueying Liu, Yufan Luo, Xue Xiao, Wei Xiao, Ruijin Xie & Hua Xu
Department of Neonatology, Jiangyin People’s Hospital of Nantong University, Wuxi, China
Xianhui Deng

Authors

Aimin Li
View author publications
Search author on:PubMed Google Scholar
Yueying Liu
View author publications
Search author on:PubMed Google Scholar
Yufan Luo
View author publications
Search author on:PubMed Google Scholar
Xue Xiao
View author publications
Search author on:PubMed Google Scholar
Wei Xiao
View author publications
Search author on:PubMed Google Scholar
Ruijin Xie
View author publications
Search author on:PubMed Google Scholar
Xianhui Deng
View author publications
Search author on:PubMed Google Scholar
Zhe Chen
View author publications
Search author on:PubMed Google Scholar
Qian Zhou
View author publications
Search author on:PubMed Google Scholar
Yue Gong
View author publications
Search author on:PubMed Google Scholar
Zhen Chen
View author publications
Search author on:PubMed Google Scholar
Hua Xu
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, H.X.; Data curation, X.D.; Formal analysis, Y.L. and H.X.; Funding acquisition, Y.L. and R.X.; Investigation, A.L. and H.X.; Methodology, A.L., W.X., R.X., X.D., Q.Z., Z.C. and H.X.; Resources, Y.L., Y.L., W.X. and X.D.; Software, X.X., W.X., Z.C., Q.Z. and Z.C.; Validation, Y.L. and X.X.; Visualization, A.L., Y.L., W.X., R.X., Y.G. and Z.C.; Writing—original draft, A.L., Y.L., X.X., Z.C. and Q.Z.; Writing—review & editing, Y.G.

Corresponding author

Correspondence to Hua Xu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Aimin Li, Yueying Liu and Yufan Luo: Co-first authors.

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, A., Liu, Y., Luo, Y. et al. Machine learning based predictive model of the risk of Tourette syndrome with SHAP value interpretation: a retrospective observational study. Sci Rep 15, 18336 (2025). https://doi.org/10.1038/s41598-025-01874-7

Download citation

Received: 10 January 2025
Accepted: 08 May 2025
Published: 26 May 2025
DOI: https://doi.org/10.1038/s41598-025-01874-7