Introduction

Although radiation therapy is an efficient and primary treatment for head and neck cancer (HNC), it can cause different side effects and toxicities that have a significant impact on quality of life, such as xerostomia or dry mouth1. To overcome these issues, advanced techniques like Intensity Modulated Radiation Therapy (IMRT) used aim to maximize dose to tumors while minimizing radiation dose to surrounding tissues2. However, these techniques have limited effectiveness because normal organs are directly adjacent to tumor volumes. For example, at least 50% of HNC patients are likely to develop some level of xerostomia following radiotherapy3. Patients who suffer from xerostomia experience difficulty talking, chewing and swallowing, oral health including dental caries, tooth demineralization, and opportunistic candidiasis. Parotid doses as low as 10–20 Gy given in 2 Gy fractions can result in an acute reduction in saliva production of 50–60%4. During a protracted course of radiotherapy, saliva production can drop to 20% after 7 weeks of treatments5. Radiobiological indices such as tumor control probability (TCP) and normal tissue complication probability (NTCP) are used to evaluate the efficacy of radiotherapy. Various NTCP models have been developed that rely on the use of dose-volume data obtained from radiotherapy planning6,7. These approaches do not consider the nature of patient specific tissue characteristics. Recently, more sophisticated models such as radiomics have been developed that utilize patient-specific information.

Radiomics based image features have been utilized with machine learning techniques to predict treatment outcomes including the onset of complications8,9,10,11,12. It aims to find robust statistical connections between imaging and clinical results13,14 and is a promising method for data-driven selection of patients for radiotherapy treatment15,16. There are several studies that demonstrated how image features of different modalities predict clinical outcomes of interest17,18. A sub-set of these has shown that radiomics features extracted from the parotid gland predict radiation-induced malfunction in patients with HNC8,19,20,21,22,23,24. A complementary method was explored by Liu et al.25 and van Dijk et al.23,26who analyzed differences in image characteristics of the parotid glands calculated by Computed Tomography (CT) performed at different times during the radiation therapy and found them to be associated with xerostomia. van Dijk and colleagues23,26 have identified surrogate radiomics-based biomarkers utilized in receiver operating characteristic (ROC) analyses that are predictive of late-onset xerostomia. Liu et al.25 proposed a new prediction system for xerostomia based on salivary amount and radiomics changes in CT images of patients with nasopharyngeal cancer treated with radiotherapy. They concluded periodic CT scans along with evaluation of salivary volume could be used for the prediction of acute xerostomia. In a retrospective study, Sheikh et al.20 introduced Magnetic Resonance (MR) based image features of salivary glands for prediction of xerostomia at post-irradiation in patients completing radiotherapy for HNC. Their results showed that the CT/MR-based image features could reflect baseline function of salivary gland and the potential risk for radiation toxicity due to radiotherapy. van Dijk et al.8 investigated whether the changes in the parotid glands seen in weekly CT imaging during treatment could improve the prediction of moderate to severe xerostomia using delta-radiomics features. In another study, van Dijk et al.22 investigated whether biomarkers from MR images were related to the finding of xerostomia 12 months after radiotherapy. Their results indicated a positive correlation of these characteristics to the onset of radiation-induced late xerostomia. In addition, the inclusion of these biomarkers showed improvement compared to a reference model.

These and other studies have shown that the selected machine learning algorithms can greatly impact the quality of the predictive model11,27. In the same direction, another approach showing promise is using ensemble learning. In this approach, different classifiers are combined to produce a composite model. While ensemble learning has been mainly used for diagnosis and prediction of treatment response, this approach has also demonstrated usefulness in the prediction of radiation-induced complications. Brunese et al.28used ensemble learning for diagnosis of brain cancer with high accuracy. In a study by Shayesteh and co-workers29, ensemble learning methods were compared to individual models to predict neo-adjuvant chemo-radiotherapy response in rectal cancer patients. The ensemble approaches were shown to improve the predictive power. Ensemble learning methods have also shown promise in predicting non-cancerous treatment response including the response of prolactinoma patients to dopamine agonists30and in intervention for depression31.

The aim of the present study is to evaluate the use of CT, \(T_1\), and \(T_2\) weighted MR imaging data combined with dosimetric, clinical, and non-clinical features to predict early radiotherapy-induced xerostomia for HNC patients using ensemble machine learning based algorithms. Our contributions here are:

  1. (1)

    to assess ensemble learning approaches in a machine learning based methodology for the prediction of early-onset xerostomia in patients receiving radiotherapy for HNC.

  2. (2)

    to extend the application of radiomics features to both \(T_1\) and \(T_2\)-weighted MRI combined with CT imaging, to provide a comprehensive analysis of the distinct information. This method enriches the characterization of different imaging modalities, allowing for a more nuanced understanding of underlying tissue properties and potentially improving the accuracy of predictive models.

  3. (3)

    to integrate Dosimetric Volume Histogram (DVH) and demographic information with radiomics features in the machine learning-based prediction model. By combining these data sources, we have an enhanced predictive model, offering a more holistic approach to outcome prediction and potentially improving the precision of treatment planning.

Methods

Patients demographics and dosimetrics data

Between 2020 and 2021, 187 HNC patients were treated who received radiotherapy. Among these, 80 patients were selected and evaluated prospectively for the study, all of whom had healthy salivary and parotid glands. We excluded patients who had undergone parotid removal or had parotid tumors, as the study aimed to examine the parotid glands as an organ at risk (OAR). Specifically, 35 patients were excluded due to tumors in the salivary or parotid glands removal, 39 patients were excluded because dental artifacts distorted the CT images in the parotid area, 15 patients were excluded due to complications during treatment, and 18 patients did not agree to complete the follow-up questionnaires. Patients were included based on the following criteria: (1) Histological diagnosis of hypopharynx, oropharynx or nasopharynx carcinoma and lymphoma. (2) Primary treatment with radiotherapy, either alone or in combination with chemotherapy. (3) Weekly follow-up questionnaires during treatments and 3 months after the end of treatment. Patients undergoing chemotherapy were treated according to standard institutional guidelines, primarily with cisplatin or carboplatin. Detailed chemotherapy protocols, including drug types, dosages, and schedules, are provided in the Supplementary Materials.

All individuals were treated by IMRT with a prescribed dose of 50–70 Gy. The patients aged between 16 and 85 years, each underwent CT scans acquired using a CT simulator before the treatment. The delineation of parotid glands on each CT was performed by an expert radiation oncologist and independently verified by another oncologist. The process began with registering CT and MRI images using the Eclipse treatment planning system (Rigid registration). The oncologist then delineated the contours on registered CT and MRI images. At the end of course of radiation therapy, and after 3 months post-treatment each patient underwent another CT scans for follow up.

Patients’ demographics included age, family history of cancer, surgery, smoking status, and history of chemotherapy. Moreover, a questionnaire was completed weekly by a radiation oncologist comparing the original CT and cone beam CT, and the impression of salivary function recorded. The questionnaire was included all information about any changes in primary tumor site, dry month, eating, speaking, and swallowing functions. After the end of the radiotherapy sessions and after three months, the same questionnaire follow up was performed. During this period, the highest severity of the complication was considered as the endpoint.

From the dosimetric data (DVHs), we extracted the dose details for both parotids, and the volume receiving x Gy of radiation (\(V_{xGy}\)). The dose-volume table related to parotid glands is shown in Table 1. The DVHs for parotid and other organs were extracted from the treatment planning system (TPS) and used for modeling. Parotid volumes receiving 5 Gy (\(V_{5Gy}\)) to 70 Gy (\(V_{70Gy}\)) were extracted from DVH curves. Other parameters including volume, maximum dose, minimum dose, modal dose, median dose, mean dose, and diameter of equivalent sphere were obtained from the TPS for each patient. A total of 14 different DVH metrics were extracted for each patient.

All patients provided written informed consent before starting therapy that their data could be used for research purposes. The university Medical School is not applicable to data collection as part of routine clinical practice and therefore, this work was approved by the Ethics Committee of the university Medical School for the conduct of studies based on these data. All patients received standard clinical care of radiotherapy. All ethical issues relating to the patients are approved by the ethical committee of the university. All procedures performed in studies involving human participants were in accordance with the 1964 Helsinki Declaration and its later amendments.

Table 1 Mean values for dosimetric data obtained from treatment planning.

Radiomic feature extraction

The main aim of this work was to predict xerostomia after radiotherapy in HNC patients. As schematically illustrated in Figure 1, the workflow was developed in three steps: (i) feature selection was performed on each dataset, (ii) different dosimetric features extracted from the TPS (iii) finally, machine learning models were trained for each feature subset, and the classifiers were trained using the selected subset together.

Fig. 1
figure 1

Workflow of the proposed approach. Two sets of features were considered: radiomic features extracted from parotid images, and dose features extracted from DVH. Feature selection and reduction techniques were applied to the sets of features to identify subsets of significant features. Different models, Random Tree (RT), Neural Network (NN), Linear Support Vector Machine (LSVM) and Bayesian Network (BN), were trained both on the individual feature subsets and using all the feature kinds jointly.

Acquisition of CT and MR images was performed before the radiotherapy sessions as part of the treatment planning process. CT scans were obtained using a GE Hi Speed CT scanner (GE Healthcare, Milwaukee, USA) with slice thickness of 1.5 mm, field of view 500 mm, matrix size 512\(\times\)512, 120 kVp and 340 mAs. MR images were acquired from three different centers and the protocols and imagers used are shown in Table 2. There are various studies on the effect of different imaging protocols and imaging devices on the radiomics features of images32,33,34,35. Due to potential effects on the MR image acquisition from different devices, image harmonization was performed using the COMBAT technique36,37 to control for multi-center acquisition of these images. Gross tumor volume (GTV) and OARs including parotid and other salivary glands were contoured by an experienced radiotherapy professional. Treatment planning was performed using a Varian Eclipse (version 15, Varian Medical Systems, Palo Alto, Ca, USA) TPS.

Table 2 Protocols used in MRI images.

Prior to feature extraction, we applied intensity normalization to the MR and CT images to ensure consistency across different imaging modalities and centers. Image feature extraction for the contoured structures was performed using the Radiomics module in the freely available 3D Slicer software (version 4.10.0 Harvard University, National Institutes of Health). A total of 642 radiomic features were extracted from both parotids for CT, \(T_1\) and \(T_2\) weighted MR images. The features type included shape, gray level dependence matrix (GLDM), gray level co-occurrence matrix (GLCM), first order, gray level run length matrix (GLRLM), gray level size zone matrix (GLSZM) and neighborhood gray tone difference matrix (NGTDM). The number of features extracted from the images was very large, given that many of these features were redundant and should be removed.

Feature selection and model evaluation

Before any analysis, the dataset was divided into two partitions: 70% for training and validation, and 30% for testing. The division was performed using a stratified random sampling approach to ensure proportional representation of xerostomia cases across both subsets. This stratification ensured that the distributions of demographic, dosimetric, and radiomic features, as well as the prevalence of xerostomia, were consistent between the training and testing sets. Additionally, we have included a table (Table 2S) in the Supplementary Materials that provides anonymized statistics and summaries of patient allocation to the training and testing subsets. All features were normalized using the z-score technique, with the mean and standard deviation calculated from the training data and subsequently applied to the testing data to prevent data leakage. Moreover, a 5-fold cross-validation and random search were conducted on the training data to optimize the hyperparameters of the classifiers. The test dataset, kept completely independent, was used solely for final model evaluation.

One of the important stages in modeling was feature selection. For modeling, there are different algorithms for feature selection. In this study, Pearson linear regression was used for selection of the features. In the selection process, all features were ranked from 0 to 1 based on their relevance to the prediction. We did not set a specific number of features to select; instead, our criterion was the importance of each feature. A feature was selected for modeling if its importance exceeded 0.95 and above. Features selected in the modeling are shown in Table 3. Here, the models used in the evaluation were Random Tree (RT), Neural Network (NN), Linear Support Vector Machine (LSVM) and Bayesian Network (BN) approaches. Different algorithms were combined by using the ensemble learning method as illustrated in Table 4. Specifically, the ensemble learning was conducted using the MATLAB Machine Learning toolbox, which includes an ’Ensemble’ operator for combining multiple models. The parameters for each algorithm were optimized during the model training process:

  • RT: The number of trees was set to 100, the maximum tree depth limited to 10 to prevent overfitting. A minimum of 5 samples per leaf was required.

  • NN: A feed-forward architecture with 2 hidden layers was utilized. The activation function chosen was ReLU, with a learning rate of 0.01 to support gradient-based learning.

  • LSVM: A linear kernel was used with a regularization parameter of 1 to balance the trade-off between model complexity and error.

  • BN: Maximum likelihood estimation was employed, using a smoothing parameter of \(1e^{-5}\) to account for small probabilities during structure learning.

The ensemble models were built by combining these individual algorithms in double, triplet, and quadruplet combinations (e.g., RT-NN, RT-BN, RT-NN-LSVM, etc.). The majority voting technique was employed to aggregate the predictions from the individual models. Moreover, due to the population imbalance which poses a great challenge in training any learning model, we used synthetic minority oversampling technique (SMOTE), to oversample the radiomics features and leverage the latent features38,39. The SMOTE was implemented using the Python library imblearn, which is specifically designed for oversampling minority classes in machine learning datasets. Performance of the constructed models was evaluated utilizing the ROC curve approach using accuracy, sensitivity, specificity and area under curve (AUC) metrics.

Table 3 The selected features and their importance weightings used in models studied.
Table 4 The ensemble models used in this study.

Results

Among all patients, 60% of them were determined to exhibit characteristics of early-onset xerostomia during radiotherapy. The findings in this study are reported under two major categories, the evaluation based on (i) single model, and (ii) multi model algorithms.

Single model algorithm comparison

According to Tables 5, 6, 7 single model AUC values varied from 0.64 to 0.92 depending on features and image types used. Table 5 and Figure 2 summarize results for individual learning models using \(T_1\) weighted MR image features. The RT model showed better predictive performance having AUC, sensitivity, specificity, and accuracy values 0.90, 0.81, 0.88 and 0.80, respectively. Individual model performance using \(T_2\) weighted MR image features is indicated in Table 6 and Figure 3. Noticeably lower performance can be seen in the \(T_2\) weighted only group in general with the better performing model, RT, in this cohort having AUC, sensitivity, specificity, and accuracy values 0.79, 0.77, 0.69 and 0.80, respectively. Results from more complex modeling consisting of combined features from patient demographics, DVH metrics, and \(T_1\) weighted and CT imaging for individual models are summarized in Table 7 and Figure 4. These results are somewhat mixed when compared to the \(T_1\) weighted features only group, but generally overall improvements are indicated. The BN model shows improvement in all categories while the other individual models improve in some categories but decline in others.

Table 5 Values obtained for Demographics plus \(T_1\) weighted MRI features dataset.
Fig. 2
figure 2

ROC curves for different single model algorithms for \(T_1\) weighted images and demographics features.

Table 6 Values obtained for Demographics plus \(T_2\) weighted MRI features dataset.
Fig. 3
figure 3

ROC curves for different single model algorithms for \(T_2\) weighted images and demographics features.

Table 7 Values obtained for Demographics, DVH, \(T_1\) weighted MRI, and CT image features dataset.
Fig. 4
figure 4

ROC curves for different single model algorithms for demographics + DVH + \(T_1\) weighted MRI Features+ CT Features.

Multi model algorithm comparison

Results for multi ensemble combined models using patient demographics, DVH, \(T_1\) weighted, and CT image features have been shown below. Additionally, in order to assess the impact of sample size on the final results, we have included the outcomes for the best model across different numbers of patients.

Generally, the double ensemble models show improvements over single models and are summarized in Table 8. The RT-BN combined model as illustrated in Figure 5 shows excellent predictability for early-onset xerostomia having, AUC, sensitivity, specificity, and accuracy values 0.97, 0.96, 0.92, and 0.91, respectively. In Triple and quadruple cases, the ensemble model results are shown in Table 9. ROC curves for the best triple RT-LSVM-BN and the quadruple RT-NN-LSVM-BN combined models are illustrated in Figures 6 and 7, respectively. These higher-level ensembles do not show any significant improvement over the RT-BN combined model. In general, the triple and quadruple models show mildly degraded performance when compared to the better performing double ensemble model, RT-BN.

Table 8 Values obtained from different double combined models for Demographics, DVH, \(T_1\) weighted MRI, and CT image features datasets.
Fig. 5
figure 5

ROC curves for the RT-BN combined model with different number of patients.

Table 9 Values obtained from different triple and quadruplet combined models for Demographics, DVH, \(T_1\) weighted MRI, and CT image features datasets.
Fig. 6
figure 6

ROC curves for the RT-LSVM-BN combined model with different number of patients.

Fig. 7
figure 7

ROC curves for the RT-NN-LSVM-BN combined model with different number of patients.

Discussion

Radiation therapy usually is associated with significant acute and late toxicities, including xerostomia40. Xerostomia is a common side effect of treating HNC and is caused by irradiated salivary gland damage. The radiotherapy methods have increased the incidence of acute adverse events among long-term survivors41, but better identification of patients at higher risk of toxicity is still needed. To minimize the burden of toxicity in HNC patients, individual toxicity risk assessment is necessary so that radiotherapy and supportive care can be appropriately planned. Recently, computer models based on quantitative analysis of biomedical images such as radiomics analysis have been effectively introduced to address neglected clinical needs, mainly within the area of oncology imaging42,43,44. The two main goals for this study were to firstly investigate the use of \(T_1\) and \(T_2\) weighted MR images combined with CT imaging, dosimetric, and demographic features to supplement the feature cohort and secondly to test ensemble learning approaches as a means to establish an effective learning model for the prediction of early-onset xerostomia in patients receiving radiation therapy for the HNC.

We have successfully utilized MR image features in the development of learning models for the prediction of early-onset xerostomia in head and neck radiotherapy. The use of \(T_1\) weighted image characteristics show some improvement over those models incorporating \(T_2\) weighted image features. The radiation therapy approach proved to be the better performing single model in both the \(T_1\) and \(T_2\) weighted testing groups. For the RT model, the AUC values obtained for the \(T_1\) and \(T_2\) weighted image feature groups were 0.90 and 0.75, respectively. Trends in other single models using \(T_1\) vs \(T_2\) weighted image features showed similarities to those obtained for the RT approach. In the \(T_1\) weighted image feature group, the single LSVM and NN models tended to underperform the RT and BN approaches.

When utilizing a more comprehensive set of features including demographics, dosimetric, \(T_1\) weighted MR and CT radiomic image features as shown in Table 7, the RT model again shows the better overall performance with the BN model also showing very good performance metrics. The LSVM and NN classifiers tend to broadly underperform the RT and BN approaches in this testing group. These results indicate that MR image features can be used in combination with other more common characteristics to produce learning models that perform well in the prediction of early-onset xerostomia. Our findings indicate that the radiomics model for \(T_1\) was more accurate than for \(T_2\). We believe this discrepancy may be due to several factors. \(T_1\)-weighted images generally provide better contrast for certain anatomical structures and may be more sensitive to the features relevant to predicting early-onset xerostomia. Additionally, \(T_1\)-weighted images often exhibit clearer differentiation of soft tissues, which can enhance the model’s ability to extract meaningful radiomics features. In contrast, \(T_2\)-weighted images might have more variability in tissue contrast or less distinct boundaries for the features of interest, potentially leading to lower accuracy in the model predictions.

One of the capabilities in the modeler software is the “Ensemble” operator. This operator enables the user to combine different algorithms to construct a combined model. A comparison of the single model results in Table 7 to the double model results in Table 8 indicates general improvement in model predictability for the double model groups. More specifically, the RT-BN dual model shows improvement over either the RT or BN single model groups alone. This result should not be surprising given the RT and BN single models were the better performers among the four single models evaluated. The RT-BN dual model shows good predictive power having Accuracy, Sensitivity, Specificity, and AUC values 0.91, 0.96, 0.92, and 0.97, respectively. Triplet and quadruplet models were also evaluated which yielded mixed results. A few scattered metrics show slight improvement over some dual models, but the majority of metrics show no improvement and some even indicated a loss in predictive power. As indicated in a comparison of Tables 8 and 9, there are no higher order model combinations that outperform the RT-BN dual model. These results indicate that ensemble approaches can improve model predictability, but this usefulness may be limited to lower multiplicities.

Based on our findings, we can offer suggestions including Pre-imaging, particularly the initial CT and MRI scans, proved highly valuable for early xerostomia prediction in our study. These images allow for a baseline assessment of the parotid glands and other OARs before radiation exposure. Radiomic features extracted from these images provide predictive biomarkers that can help identify patients at higher risk of developing xerostomia, enabling earlier intervention. The ability to predict xerostomia early, before treatment begins, can guide personalized treatment planning and adaptive strategies, such as sparing the salivary glands when possible. While we did not use during-CT imaging in this study, incorporating mid-treatment imaging could potentially enhance prediction models by capturing dynamic changes in the parotid glands as radiation therapy progresses. If available, during-imaging may allow for more adaptive treatment approaches based on the evolving tissue response. For future studies, imaging acquired during the course of radiotherapy (e.g., weekly or bi-weekly scans) may offer additional predictive power by identifying early tissue damage, allowing for timely adjustments to minimize toxicity. Lastly, the post-imaging, particularly scans taken immediately after therapy and during follow-up, can provide valuable information for understanding long-term radiation effects and correlating clinical outcomes with structural changes in the salivary glands. Although post-images are less useful for “early” prediction, they may contribute to models assessing chronic xerostomia risk. Integrating these scans into predictive models may help in refining the accuracy of long-term xerostomia predictions.

While this study primarily focused on the impact of parotid gland irradiation on early-onset xerostomia, it is important to note that mandibular gland irradiation has also been identified as a contributing factor to xerostomia symptoms in head and neck cancer patients. Previous studies have shown that irradiation of the submandibular glands, which play a critical role in saliva production, can significantly exacerbate xerostomia symptoms, particularly in cases where high-dose radiation is delivered to these glands45. Future research could explore the integration of mandibular gland dosimetric parameters, such as dose-volume metrics and glandular volume, into radiomics-based prediction models. Including these factors may provide a more comprehensive understanding of salivary gland function post-radiotherapy and improve the accuracy of predictive models for xerostomia. Additionally, advanced imaging modalities, such as functional MRI or dynamic salivary gland imaging, could be used to capture the interplay between parotid and mandibular gland function. By incorporating these considerations, future studies could further refine personalized treatment planning and enhance the clinical utility of predictive models for xerostomia. Moreover, it is essential to note that we have considered the influence of sample size on the ultimate performance of the multi-ensemble model. We have presented the results for the optimal model across various patient numbers, revealing that the model’s performance tends to improve with an increasing number of patients. However, it is noteworthy that after reaching a certain sample size, the observed improvement becomes less significant (See Figures 5, 6, and 7).

An important goal of this study was to evaluate the ability to predict early-onset xerostomia by introducing the use of MR image features and ensemble leaning approaches. The use of \(T_1\) weighted MR image features and the dual RT-BN ensemble provided good outcomes. These results open the possibility of adaptive intervention to improve the severity of xerostomia in patients receiving radiotherapy for HNC. However, important limitation of this study is the lack of an independent data set to externally validate the model. The need for other studies to test this approach is warranted.

Conclusion

The results of the present study show that radiomics features from MR images prior to treatment can be used as personalized and unique biomarkers for prediction of early-onset xerostomia. Furthermore, the results from the ensemble classifiers showed enhancement of the model’s performance in the prediction of early xerostomia and the use of ensemble learning can improve the efficacy of prediction.