Introduction

The worldwide shift towards cleaner energy sources has positioned hydrogen (H2) as a versatile and sustainable fuel alternative1,2. The significance of hydrogen in energy distribution and storage has increased as global initiatives aim to reduce reliance on fossil fuels. Underground Hydrogen Storage (UHS) within geological formations, particularly saline aquifers, represents one of the most promising methodologies for large-scale hydrogen storage2,3,4,5,6. This methodology capitalizes on the distinctive properties of these formations for efficient hydrogen storage, thereby achieving a balance between the supply and demand of renewable energy. UHS in saline aquifers has attracted considerable attention owing to its technical feasibility and economic viability. The effectiveness of UHS is significantly contingent upon a thorough understanding and management of fluid–fluid and fluid–rock interactions within the reservoir. The interfacial tension (IFT) between hydrogen and brine is critical in these processes’ interactions.

IFT significantly impacts how H2 behaves within the reservoir, influencing injectivity, capillary trapping, and overall storage efficiency. The IFT between H2 and brine influences the capillary forces that govern brine displacement by H2, which is critical for increasing storage capacity and ensuring long-term stability. Higher IFT values can improve capillary trapping, boosting H2 retention in the reservoir while potentially limiting recoverability during production. In contrast, lower IFT can improve H2 mobility and injectivity but may impair storage security by reducing capillary trapping and raising the likelihood of H2 migration7,8,9.

The interaction between H2 and brine generates complicated flow patterns that must be precisely modeled to maximize storage efficiency. Temperature, pressure, salinity, and gas impurities have a substantial influence on H2–brine IFT values, and storage capacity10,11,12,13,14,15. Chow et al.10 conducted IFT measurements for the deionized water/H2 system under varying pressures (0.5–45 MPa) and temperatures (298.15–448.15 K), also incorporating the effect of CO210. The study found that IFT decreased with increasing pressure and temperature and that adding CO2 reduced IFT between the gas and water phases even more. Janjua et al.15 extended these observations to H2–brine systems, finding a notable reduction in IFT with rising temperature and a slight decrease with increasing pressure, while higher salinity caused IFT to increase15. Alanazi et al.14 investigated the impact of CH4 on the IFT between H2 and brine, revealing that the IFT values for the H2–brine system exceeded those of the CH4-brine system. This suggests that using CH4 as a cushion gas may enhance the potential for hydrogen geo-storage by reducing the IFT14. Al-Mukainah et al.11 noted a marginal pressure-dependent decrease in IFT. In contrast, Mirchi et al.13 demonstrated that an increase in temperature or methane (CH4) concentration within the gas phase led to a further reduction in the H2–methane (CH4)/brine IFT11,13. Higgs et al.12 examined the effects of salinity and pressure, observing a decreasing trend in IFT with increasing pressure; however, they found no discernible correlation between salinity, particularly sodium chloride (NaCl), and IFT12.

IFT has conventionally been evaluated in laboratory settings using the pendant drop method. However, the extreme volatility of hydrogen (H2) and the complexities of replicating reservoir conditions make direct experimental testing exceptionally challenging. Laboratory measurements are time-consuming and costly, especially in high-pressure, high-temperature (HPHT) environments, and also prone to inconsistencies due to experimental sensitivity. Molecular dynamics (MD) simulations offer molecular-scale insights into interfacial behavior, yet they are computationally intensive and constrained to limited system sizes and short timescales. These limitations underscore the growing need for alternative predictive frameworks. ML models provide a data-driven solution that overcomes many challenges by learning complex relationships from existing data and offering fast, scalable, and accurate predictions across various conditions. Therefore, developing robust ML models is essential for enabling efficient IFT estimation in H2–brine systems, which directly impacts hydrogen storage design by influencing key factors such as injectivity, capillary trapping, and long-term migration risk. By enabling more accurate and efficient modeling of hydrogen–brine interfacial behavior, this work advances underground hydrogen storage technologies, which are essential for balancing seasonal energy supply and demand, integrating renewable energy sources, and accelerating the global shift toward a low-carbon hydrogen economy.

Recent advancements in machine learning, particularly hybrid and ensemble methods, have significantly improved the accuracy of predictions for applications like rock petrophysics properties prediction, carbon capture and CO2 sequestration in saline aquifers. These contributions highlight the growing role of machine learning (ML) in enhancing predictive accuracy and optimizing operational strategies in different applications16,17,18,19. Similarly, sophisticated modeling techniques have emerged as effective instruments for predicting IFT and enhancing ultra-high strength (UHS) performance. Hosseini and Leonenko20 established four intelligent models—Decision Trees (DT), Random Forests (RF), Support Vector Machines (SVM), and Multi-Layer Perceptron (MLP)—designed to forecast IFT based on input parameters such as pressure, temperature, and molality20. Their analysis highlighted the superior performance of the RF model, which achieved an R2 of 0.96, RMSE of 1.50, and AARD of 1.84%. The ensemble nature of RF, which excels at capturing complex data relationships, was credited for its enhanced predictive accuracy. Temperature was the most significant variable affecting IFT, followed by molality and pressure. Gbadamosi et al.21 proposed ML models, including Gaussian Process Regression (GPR), Elman Neural Network (ENN), and Logistic Regression, to predict IFT in H2–brine systems21. GPR outperformed others, yielding a mean absolute percentage error of 0.0512, making it the most reliable model for IFT prediction. Ng et al.22 employed ML methods, including genetic programming (GP), gradient boosting regressor (GBR), and a multi-layer perceptron (MLP) optimized with Levenberg–Marquardt and Adam algorithms, to model IFT22. The study’s findings demonstrated that the MLP-LMA model exhibited the highest level of accuracy, achieving a coefficient of determination (R2) of 0.9997 and an Average Absolute Percentage Relative Error (AAPRE) of 0.1907%. Kohzadvand et al.23 investigated the prediction of contact angles in H2/mineral/brine systems, employing models such as Multi-Layer Perceptron (MLP), Radial Basis Function (RBF), and Least Square Support Vector Machine (LSSVM)23. The LSSVM model outperformed the others, achieving an R2 of 0.994, RMSE of 1.61, and AARD of 1.69%.

Despite significant advances in experimental and computational methods, accurately predicting IFT in H2–brine systems under realistic subsurface conditions remains a complex challenge, especially due to hydrogen’s volatility and the variability in reservoir conditions such as pressure, temperature, salinity, and gas composition. Most existing models are either limited in scope or do not generalize well across different chemical environments and operational settings. To address this gap, this study develops a set of different ML models that incorporate a broad range of influencing parameters, including gas impurities (e.g., CH4, CO2) and diverse salt types. A key novelty of this work lies in introducing physically meaningful input transformations—such as salt equivalency and gas property generalization—which reduce input dimensionality while maintaining high model accuracy. This approach enhances model interpretability and applicability and supports robust IFT prediction across various hydrogen storage scenarios in saline aquifers. The following sections outline the complete workflow, beginning with the dataset description and methodology in “Methodology” Section, followed by model development and performance evaluation in “Results and discussion” Section. “Conclusions” Section discusses the sensitivity and interpretability analyses, Sect. 5 introduces the input parameter generalization strategy, and finally, Sect. 6 presents the key conclusions and implications of the study.

Methodology

Data collection and preparation

A diverse dataset was compiled from various literature sources, encompassing IFT measurements under varying pressure, temperature, salinity, and gas composition conditions10,11,12,13,14,15,24,25. A thorough data processing procedure ensured only relevant IFT measurements were included. To ensure data quality, all collected measurements underwent a QA/QC process, which included unit consistency checks, outlier removal, and duplicate filtering. Visual inspection and statistical diagnostics (e.g., skewness) were also employed to validate the integrity and distribution of the dataset. Table 1 provides a univariate statistical analysis of the input and output parameters from a dataset comprising around 500 data points. The temperature (T) ranges from 71.6 to 348.3°F, while pressure (P) varies between 14.5 psi and 6554 psi. The concentration of gas impurities, such as CO2 and CH4, spans from 0% (Pure H2 gas) to 100.0% for each. The total salinity is reflected by the molar concentrations of salts, including Na2SO4, NaCl, CaCl2, MgCl2, and KCl, with values ranging from 0.0 to 3.0 M for each salt type. The IFT ranges between 24.2 and 89.0 mN/m. The standard deviation for pressure and temperature are notably high at 1724 psi, and 51.6 °F, respectively, indicating significant variability in the pressure and temperature data. The skewness coefficient reveals that temperature and most salts exhibit a positive skew, suggesting a longer tail on the right side of the distribution. In contrast, the IFT has a slightly negative skew, indicating a distribution with a longer tail on the left.

Table 1 Statistical analysis of input and output parameters using univariate methods.

Figure 1 illustrates the seaborn.pairplot, which depicts cross plots of the parameters relative to one another, with the diagonal component representing the individual data distributions. The last row emphasizes the correlation between IFT values and various properties. A pronounced negative correlation is identified between IFT, operating pressure, temperature, and impurities such as CH4 and CO2. Conversely, elevated salt concentrations positively correlate with IFT, as increased salt levels correspondingly enhance the IFT values. The diagonal plots delineate the distribution of each parameter, revealing that IFT approximates a normal distribution. In contrast, other parameters demonstrate slightly right-skewed distributions, as their positive skewness coefficients indicate.

Fig. 1
figure 1

The Seaborn pair plot illustrates the cross plots of the parameters in relation to one another, while the diagonal displays the individual distributions of the respective data.

The Pearson correlation coefficient (R) was calculated to analyze how the IFT value depends on the input parameters. R-values were derived using Eq. 1.

$$R = \frac{{\mathop \sum \nolimits_{{{\text{i}} = 1}}^{{\text{n}}} \left( {x_{{\text{i}}} - \overline{x}} \right)\left( {{\text{y}}_{{\text{i}}} - \overline{y}} \right)}}{{\sqrt {\mathop \sum \nolimits_{{{\text{i}} = 1}}^{{\text{n}}} \left( {{\text{y}}_{{\text{i}}} - \overline{y}} \right)^{2} \mathop \sum \nolimits_{{{\text{i}} = 1}}^{{\text{n}}} \left( {{\text{x}}_{{\text{i}}} - \overline{x}} \right)^{2} } }}$$
(1)

R = correlation coefficient, xi = sample input values, x̅ = mean of input values, yi = sample output values, y̅ = mean of output values.

Figure 2a illustrates a heat map displaying the correlation coefficients among all parameters. Meanwhile, Fig. 2b summarizes the correlation coefficients between IFT and the input parameters, revealing distinct strength and direction relationships. Among the salts, Na₂SO₄ (0.32), CaCl2 (0.33), and MgCl2 (0.32) exhibit high positive correlations, indicating that an increase in the concentration of these salts leads to an increase in IFT. NaCl (0.10) and KCl (0.11) have weaker positive correlations, indicating minimal IFT effects. Temperature (T) shows a strong negative correlation (− 0.60) with IFT, making it the most influential factor in reducing IFT, likely due to molecular agitation that disrupts cohesive forces. CO2 (− 0.47) and CH4 (− 0.27) also have negative correlations, with CO2 being more effective at reducing IFT, possibly due to its higher solubility. Pressure (P) exhibits a weaker negative correlation (− 0.24), suggesting only a slight reduction in IFT with increasing pressure. Overall, temperature and CO2 are key factors in reducing IFT, while salts like Na₂SO₄ and CaCl2 highly increase it. All input features were scaled using Min–Max normalization to ensure consistent input ranges and enhance training stability, particularly for neural network models.

Fig. 2
figure 2

The correlation coefficient results are presented as follows: (a) a heat map illustrating the Pearson correlation coefficient among all parameters in relation to one another, and (b) the correlation coefficient between the IFT and various input parameters.

Model development

Figure 3 outlines the process of developing ML models. It starts with Data Preparation, which entails gathering and processing relevant data to provide high-quality input for model development. This study employed a random subsampling cross-validation method, where the data is split randomly into training and testing sets with different ratios, ranging from 40/60 to 20/80, to assess model performance across various splits. The next phase is Model Training, where hyperparameters are fine-tuned to achieve optimal model accuracy. Subsequently, model testing occurs, allowing adjustments to the split ratio for further performance refinement. The Model Validation step uses 100 unseen data points to evaluate the model’s generalizability, ensuring it performs well on data not used during training. Additionally, to further confirm the effectiveness of the developed ML model, a tenfold cross-validation approach was utilized26,27,28. After validation, Sensitivity Analysis is conducted to understand the impact of different input variables on the model’s predictions. Finally, the process concludes with a step to generalize the input data, simplify the model, and generalize the input parameters so the model can be used in a wide range of input parameters.

Fig. 3
figure 3

The different Procedures used to develop the different ML Models.

The evaluation matrix of the model comprises the coefficient of determination (R2), Mean Absolute Percentage Error (MAPE), Average Percentage Relative Error (\(\text{APRE})\), and the root mean square error (RMSE). The calculations for R2 and RMSE can be conducted utilizing Eqs. 2 through 5.

$${\text{R}}^{2} = 1 - \frac{{{\text{SS}}_{{\text{E}}} }}{{{\text{SS}}_{{{\text{YY}}}} }}$$
(2)
$${\text{SS}}_{{\text{E}}} = \mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} \left( {{\text{y}}_{{\text{i}}} - {\hat{\text{y}}}_{{\text{i}}} } \right)^{2}$$
(3)
$${\text{S}}_{{{\text{YY}}}} = \mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} \left( {{\text{y}}_{{\text{i}}} - {\overline{\text{y}}}} \right)^{2}$$
(4)
$${\text{RMSE}} = \sqrt {\frac{{{\text{SS}}_{{\text{E}}} }}{{\text{n}}}}$$
(5)
$${\text{MAPE}} = \frac{100}{{\text{n}}}\mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} \left| {\frac{{{\text{y}}_{{\text{i}}} - {\hat{\text{y}}}_{{\text{i}}} }}{{{\text{y}}_{{\text{i}}} }}} \right|$$
(6)
$${\text{APRE}} = \frac{100}{{\text{n}}}\mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} \left( {\frac{{{\text{y}}_{{\text{i}}} - {\hat{\text{y}}}_{{\text{i}}} }}{{{\text{y}}_{{\text{i}}} }}} \right)$$
(7)

SSE is the sum of residual squares, n is the number of data points, yi is the actual data values, \(\hat{y}_{i}\) is the predicted values and \(S_{YY}\) is the sum of data variation squares from the mean.

In pursuing high-accuracy prediction models for H2–brine IFT, a diverse set of ML algorithms was selected based on their proven efficacy in handling complex, nonlinear, and multivariate datasets. These models include Linear Regression (LR), DT, RF, GBR, Extreme Gradient Boosting (XGBoost), and Artificial Neural Networks (ANN). This selection balances interpretability and predictive power, comprehensively evaluating simple and advanced modeling techniques.

LR is a baseline model that offers a straightforward method for evaluating linear dependencies. Despite its limitations in modeling nonlinear relationships, it is useful for benchmarking and understanding the minimum complexity required to capture trends in the dataset.

DT are rule-based models that recursively partition data according to feature values. They provide transparency and interpretability, making them attractive for model-explainability applications. However, single DT are prone to overfitting, especially when the large feature space or interactions are complex. Ensemble methods were included to overcome these limitations. RF, introduced by Breiman29, builds multiple DT on bootstrapped samples and averages their outputs, reducing variance and improving robustness. RF is known for its stability, resistance to overfitting, ability to capture nonlinear interactions, and feature importance rankings.

GBR and XGBoost further enhance prediction by sequentially training trees on the residuals of previous ones. GBR is known for its performance in many structured data applications, while XGBoost incorporates regularization and parallel computing, making it both accurate and computationally efficient30. These models are well-suited for regression problems involving high-dimensional or correlated variables, such as brine salinity and gas composition.

ANNs, on the other hand, offer a powerful paradigm for capturing complex, highly nonlinear relationships through their layered architecture and activation functions31. ANNs have been widely adopted in subsurface modeling, including wettability, porosity prediction, and CO2 storage potential. Their flexibility allows them to learn intricate dependencies among variables like pressure, temperature, and impurity concentrations in fluid systems.

The combination of tree-based and neural models ensures a holistic comparison across model families, capturing both interpretable structure (in RF, DT) and deep pattern learning (in ANN). This methodological diversity increases the robustness of conclusions. It facilitates the identification of models best suited for real-world UHS scenarios, where physical interactions are often nonlinear and multivariate in nature.

Results and discussion

Linear regression model results (LR)

The LR model was applied to examine the linear relationship between input parameters and IFT. In Fig. 4a, a cross plot compares predicted IFT values from the LR model against the actual IFT values for both the training and testing datasets. Ideally, if the model were accurate, the points would align along the 45-degree line, representing perfect prediction. However, the plot shows a noticeable deviation from this line in both sets, with an R2 value of 0.76, indicating that the LR model only explains 76% of the variance in IFT, leaving significant error. This deviation suggests that the linear model fails to capture the complexity of the relationship between IFT, operating conditions, and gas composition.

Fig. 4
figure 4

The results of the LR model (a) crossplots for the actual versus the predicted IFT values in both training and testing data sets, and (b) the residual analysis versus the IFT values.

Additionally, a residual analysis (shown in Fig. 4b) was performed by plotting the difference between actual and predicted IFT values. The residuals, ideally hovering around zero if the model is accurate, display substantial variation. The errors in predicted IFT reach up to 20 mN/m, with residual values frequently deviating from zero across different IFT values. This further confirms that the LR model struggles to accurately predict IFT, particularly due to the model’s oversimplification of non-linear interactions between the input variables and IFT. Consequently, these findings point toward the need for a more complex model that can better represent the underlying physics or non-linear patterns in the data.

Machine learning models results

Several ML models were developed to predict the IFT between brine and H2, including RF, DT, GBR, XGBoost, and ANN. Each model hyperparameters were tuned to optimize their performance, and the optimized hyperparameters are summarized in Table 2. Grid search method was utilized to optimize the hyperparameters for each ML model to enhance their performance. The grid search approach systematically explored a range of values for each hyperparameter and searched for the optimum hyperparameter combination that gave the highest performance. For the RF model, the optimal hyperparameters included using “sqrt” as the maximum feature selection method, a maximum depth of 25, and 125 estimators. The GBR model was optimized with a learning rate of 0.2, 50 estimators, and a subsample size of 0.5. The DT model achieved optimal performance with a maximum depth of 7 and the “sqrt” as the maximum feature. The ANN architecture consisted of a Sequential model with an explicit input layer, followed by three dense layers, each containing 50 neurons. The ‘relu’ activation function was used for the first and third layers, while ‘tanh’ was applied to the second layer. With a single neuron, the output layer was designed to predict IFT values. The data was split with a 70/30 training-to-testing ratio, and approximately 100 data points were set aside for validation purposes. Figure 5 shows the convergence plots for the RF, GBR, XGBoost, and ANN. It illustrates the training progression in terms of RMSE over iterations or epochs. For RF, the RMSE decreases sharply as the number of trees increases initially and then stabilizes, indicating that adding more trees beyond a certain point provides diminishing returns in terms of error reduction. The GBR plot shows a similar trend where early iterations significantly improve RMSE, which plateau as the number of boosting stages increases. The XGBoost convergence graph reveals a consistent decrease in training RMSE, showcasing effective learning and generalization over the iterations. However, the ANN plot displays a dramatic drop in training RMSE within the initial epochs, quickly converging to a stable low error rate, suggesting that the model rapidly captures the underlying patterns in the data.

Table 2 Tested options chosen to enhance the developed ML.
Fig. 5
figure 5

The convergence plots for the RF, GBR, XGBoost, and ANN for the training progression in terms of RMSE over iterations or epochs.

Figure 6 demonstrates the actual versus predicted IFT values crossplots for the developed ML models of training, and testing datasets. RF, XGBoost, and GBR outperform the other models, achieving R2 values close to 0.996 during training and remaining high on the testing dataset at 0.970, 0.975, and 0.972, respectively. This indicates that these models are highly effective in capturing the underlying trends in the data, with minimal overfitting or loss of predictive power when exposed to new data. The ANN model shows a moderate performance, with an R2 of 0.975 during training and a slightly lower 0.960 on the testing data. While the drop is insignificant, it suggests that ANN is reasonably effective but performs slightly less than RF, XGBoost, and GBR. The DT model, however, shows clear signs of overfitting. While it achieves an R2 value of 0.991 in the training set, its performance drops significantly to 0.853 in the testing set. This discrepancy indicates that the DT model is fitting too closely to the training data and cannot generalize the prediction of the new data, a typical sign of overfitting.

Fig. 6
figure 6figure 6

Cross plots comparing the actual and predicted IFT values from various ML models, displayed for both training (left column) and testing (right column) datasets.

The findings from both R2 and the cross-plots indicate that non-linear models, such as RF, GBR, and ANN, significantly outperformed the linear model (LR). RF, XGBoost and GBR showed the strongest predictive capabilities, with tightly clustered data points along the 45-degree line in the cross-plots, reflecting their ability to accurately predict the IFT as a function of the input parameters. The ANN’s architecture of dense layers with ‘relu’ and ‘tanh’ activations allowed it to capture complex non-linear relationships in the IFT data, but its cross-plot trends suggest that further tuning could reduce the observed residuals and improve prediction accuracy.

Models validation

The evaluation of various ML models on the validation dataset, which comprises 100 previously unseen data points, offers a more comprehensive insight into their generalization capabilities. Figure 7 compares each model’s predicted IFT values (represented by solid lines) against the actual IFT values (depicted as dots). Among the models, the RF, XGBoost, and GBR models demonstrated superior predictive power, closely aligning with the actual IFT values throughout the validation set. These models maintained strong performance, as indicated by their high R2 higher than 0.975, and low RMSE values less than 1.3 and similar MAPE values less than 1.5%, confirming their ability to generalize beyond the training and testing datasets accurately.

Fig. 7
figure 7

A comparison of the predicted IFT values from various ML models (solid lines) against the actual IFT values (dots) in the validation dataset.

In contrast, although relatively accurate, the DT model exhibited signs of overfitting. This is evident from its slight misprediction of IFT values for a few data points in the validation set, likely due to its excessive focus on the training data patterns with R2 value of 0.870 and RMSE of 2.732 (MAPE of 2.7%), where the model struggles to capture the full variability in the unseen data, a direct consequence of overfitting.

While capturing the general trend of the data, the LR model failed to account for the variations in IFT values, resulting in lower accuracy. With an R2 of 0.760 and RMSE of 3.250 (MAPE of 3.4%), the LR model shows its limitations in handling complex, non-linear relationships present in the data.

The ANN demonstrated robust performance, with an R2 of 0.980 and RMSE of 2.074 (MAPE of 2%) in the validation dataset. While not as precise as RF and GBR, the ANN model still accurately captured the IFT behavior, offering good predictive capability, though it slightly underperformed compared to the ensemble methods.

Figure 8 summarizes the R2 and RMSE values across the training, testing, and validation datasets, highlighting the differences in model performance. RF, XGBoost and GBR models were the most reliable, accurately predicting the IFT values across all datasets, including the unseen validation set. While following the trend, the DT model struggled with overfitting, and the LR model missed much of the data’s variation. The ANN model provided solid performance, slightly trailing the RF and GBR models but still offering a reliable prediction of IFT values.

Fig. 8
figure 8

R2, RMSE, and MAPE values across the training, testing, and validation datasets, highlighting the differences in model performance.

The superior performance of ensemble-based models such as RF, GBR, and XGBoost can be attributed to their ability to capture complex, nonlinear interactions among input variables and their robustness against overfitting. These models inherently leverage multiple decision trees and either bagging (RF) or boosting (GBR and XGBoost) strategies to reduce variance and bias, resulting in more stable and accurate predictions. Given the high-dimensional and heterogeneous nature of the dataset—characterized by variables such as pressure, temperature, multiple gas components, and varying salt types—the ability of these models to handle feature interactions and nonlinear relationships is critical. In contrast, simpler models like LR and DT lack the flexibility to model these complexities effectively, which explains their comparatively lower accuracy and wider error distributions. This reinforces the suitability of ensemble learning techniques for interfacial tension prediction in multiphase, multicomponent subsurface systems.

Furthermore, to evaluate the distribution of the residual (Actual—predicted values), Fig. 9 presents the APRE (Fig. 9a) and the residual frequency distribution (Fig. 9b) for the different ML models. APRE measures the average signed percentage difference between predicted and actual values, providing insight into model bias—whether a model tends to overestimate or underestimate the target variable. All models exhibit APRE values very close to zero (between approximately − 0.4% and + 0.1%) as shown in Fig. 9a, indicating that none of the models show significant bias reflecting good regression process except LR shows slightly −ve APRE value reflecting slight overestimation. Figure 9b shows all developed ML models’ normalized residual frequency distributions. Ideally, an unbiased model should have a histogram centered around zero and display a roughly normal distribution, reflecting balanced prediction errors and consistent performance. The RF, GBR, and XGBoost models exhibit sharply peaked, lightly tailed distributions centered at zero, indicating low bias and minimal error spread—consistent with their high accuracy and low variance reported in Fig. 8. The ANN residuals are also fairly centered but show a slightly broader spread, suggesting moderate variance. In contrast, LR and DT display more flattened and heavily tailed distributions, highlighting their greater sensitivity to outliers and lower predictive accuracy. These error patterns confirm that the ensemble models reduce mean error and control prediction variance more effectively, supporting their robustness for IFT prediction.

Fig. 9
figure 9

APRE (a) and the residual frequency distribution (b) for the different ML models.

Identifying outlier data is a critical step in model validation, as it directly influences the reliability and generalization of any machine learning model. Outliers are defined as data points that deviate substantially from most observations. In large experimental datasets, such anomalies are common and, if left unchecked, can skew model performance evaluations. In the present study, the leverage method—a well-established statistical tool—was employed to identify potential outliers and define the applicability domain of each developed model. This method involves computing the Hat matrix to measure leverage and evaluating standardized residuals to detect deviations from experimental data. Points with leverage values greater than a critical threshold  are influential, while standardized residuals beyond ± 3 denote statistically significant prediction errors32.

As shown in the combined William’s plot for the test set (Fig. 10), the majority of the model predictions fall within the defined applicability domain (leverage ≤   = 0.10 and R ≤ 3), reflecting strong predictive reliability. Ensemble-based models such as RF, GBR, and XGBoost demonstrate particularly tight clustering of residuals near zero and minimal high-leverage outliers. This indicates both high generalization capacity and statistical robustness of the developed models. The ANN also performs well, although with a slightly broader residual spread. In contrast, LR and DT show more frequent deviations, suggesting these simpler models may not adequately capture complex, nonlinear patterns in the IFT dataset. Only a few points (relative to the full test set) exceeded either the residual or leverage thresholds, which is consistent with experimental noise and expected variance. Thus, the plot confirms that the bulk of the predictions are statistically valid and fall within the model’s domain of applicability, reinforcing both the quality of the modeling and the consistency of the underlying experimental data.

Fig. 10
figure 10

William’s Plot Based on the Test Set for All Machine Learning Models.

To assess the model’s robustness, a tenfold cross-validation technique was implemented, advancing beyond the random subsampling method. This process divides the dataset into ten subsets (folds), using each as a validation set once, while the remaining nine train the model. Repeating this cycle ensures comprehensive evaluation across the dataset, reducing bias and enhancing insights into performance26,27,28.

The results of this rigorous cross-validation process are depicted in Fig. 11a, which presents a cross-plot of the predicted versus actual values for the validation set. As observed in the plot, the close alignment of all data points along the 45-degree line indicates excellent model performance. This alignment suggests that the predictions are accurate and consistent across different subsets of data, confirming the model’s strong generalization capabilities across the entire dataset. Figure 11b summarizes the R2 values obtained from each fold, consistently indicating high model performance. The R2 values ranged from 0.94 to 0.988, with an average of 0.975 and a standard deviation of 0.013, demonstrating the model’s effectiveness without any signs of overfitting. This meticulous approach to testing ensures the model’s robustness and ability to generalize across varied datasets.

Fig. 11
figure 11

tenfold cross-validation results, (a) Predicted versus actual IFT values cross plot, (b) R2 values for the different folds during the cross-validation process.

Sensitivity analysis

The RF model was used for the sensitivity analysis of input parameters on IFT values. It predicted IFT between brine and H2 using random values for T, P, CO2, CH4, Na2SO4, NaCl, CaCl2, MgCl2, and KCl. Sobol sensitivity analysis assessed the influence of each parameter on predicted IFT, effectively quantifying first-order and total-order sensitivities. First-order indices (S1) indicate the direct effect of a single parameter on the output, while total-order indices (ST) measure the combined effect of a parameter, including its interactions with other parameters. This sensitivity analysis helps identify which input factors have the most significant impact on IFT predictions, guiding future experimental and modeling efforts.

Random values for the input parameters were generated within the minimum and maximum values for each parameter as indicated in Table 1. 10,000 realizations of IFT were estimated based on the different input parameters. The results indicate that temperature has the most significant influence on IFT, with a first-order Sobol index (S1) of approximately 0.5618, meaning that around 56% of the variance in the IFT can be attributed to temperature alone (Fig. 12). Pressure and CO2 concentration also exhibit moderate effects, with S1 values of 0.0888 and 0.1659, respectively. Smaller contributions come from CH4, Na2SO4, NaCl, CaCl2, and MgCl2, while KCl shows a low first-order effect, with a negative S1 value. Among the different salts, CaCl2 and MgCl2 showed the highest index, compared to the monovalent salts.

Fig. 12
figure 12

First-order indices (S1), and total-order indices (ST) for the Sobol sensitivity analysis.

The total-order indices (ST) confirm these findings, showing that temperature’s total effect is 0.5965, and both CO2 and pressure contribute significantly with ST values of 0.2152 and 0.1280, respectively. The relatively small differences between S1 and ST for most parameters suggest minimal interactions between variables, except for CH4 and some of the salts, where the total-order effects are slightly larger, indicating some interaction with other variables. Overall, temperature, pressure, and CO2 concentration are the dominant factors influencing IFT in this system.

The correlation coefficient shown in Fig. 13 illustrates the linear relationship between IFT and various input parameters. Operating pressure and temperature, along with gas impurities such as CO2 or CH4, are associated with lower IFT values between brine and H2. Conversely, different salts generally increase IFT values, with divalent salts (MgCl2 and CaCl2) having a more pronounced effect compared to monovalent salts.

Fig. 13
figure 13

Correlation coefficient for the input parameters versus IFT reflecting the importance of each based on the sensitivity analysis.

The cumulative frequency plot illustrates a clear correlation between IFT values and temperature, emphasizing the significant impact temperature has on IFT, as depicted in Fig. 14. When the temperature reaches 350°F, all potential IFT measurements remain below 60 mN/m, irrespective of other parameter variations. This indicates that IFT consistently decreases at elevated temperatures. Conversely, at temperatures under 100°F, over 20% of the IFT readings surpass 60 mN/m, reflecting a wider spectrum of higher IFT values at lower temperatures. This trend underscores the essential role temperature plays in influencing IFT behavior, with lower temperatures leading to markedly higher IFT values, highlighting its crucial role in regulating interfacial interactions in systems with brine and H2.

Fig. 14
figure 14

Cumulative frequency for different temperature values at random values of the rest of other input parameters.

SHAP (SHapley Additive exPlanations) quantifies the individual contributions of each input parameter to a model’s predictions by calculating the marginal effects of features across all possible combinations of feature values. SHAP is a model-agnostic interpretability technique that originates from cooperative game theory. This comprehensive approach ensures a fair attribution of impact to each feature, effectively capturing both linear and nonlinear relationships and interactions between features.

SHAP values were calculated for each feature using a trained ML model. They were used to assess the global importance of features, as shown in the bar chart in Fig. 15. This chart indicates that temperature exhibits the highest average SHAP value, suggesting a significant influence on the model’s predictions. CO2 and CaCl2 also have substantial impacts. CH4 and MgCl2 contribute notably, though their effects are less pronounced than the top contributors. In contrast, Na2SO4, NaCl, and KCl demonstrate minimal influence on the model’s output.

Fig. 15
figure 15

SHapley Additive exPlanations results, (a) input feature importance bar plot, (b) SHAP forced plot.

The forced SHAP plot in Fig. 16 elaborates on how various parameters impact the model’s predictions across a specified range. In this visualization, parameters in blue reduce the IFT predictions, indicating a negative impact, whereas red parameters increase the IFT, suggesting a positive effect. Case a features high salinity, pressure, temperature, and CO2, with CaCl2, MgCl2, and Na2SO4 positively influencing the IFT, yet higher pressure and temperature alongside gas impurities push it down, balancing near the base case. Conversely, Case b, with lower input values, results in a prediction close to the base. Case c, showcasing high CO2 and pressure without salts, lowers the IFT to 52.4 mN/m, below the base of 66.3 mN/m. Case d adds high temperature to the mix in case c, pushing the IFT even lower to 37.6 mN/m. In cases e and f, increased salinity drives the IFT up to 70.24 mN/m in the absence of gas impurities, pressure, and high temperature.

Fig. 16
figure 16

SHAP forces plots for local interpretability, (a) high input parameter values, (b) low input parameters values, (c) gas impurities and pressure effect, (d) Temperature, Pressure, and gas impurities effects on IFT, (e) increasing the salinity in presence of CO2, and (f) increasing the salinity in absence of CO2.

The sensitivity analysis using the Random Forest (RF) model, based on Sobol indices, was conducted to quantify input parameters’ individual and interactive effects on IFT prediction. Temperature emerged as the most influential factor, followed by CO2 concentration and pressure, with both first-order and total-order indices confirming their dominant roles. These findings were further supported by correlation analysis and SHAP values, which consistently highlighted temperature as the leading contributor to model predictions. Beyond identifying key features, these results directly affect underground hydrogen storage (UHS) operations and experimental design. For instance, higher temperatures and elevated CO2 levels reduce IFT, enhancing hydrogen injectivity but potentially decreasing capillary trapping. Pressure also plays a crucial role in modulating fluid behavior and storage dynamics. From an operational standpoint, this insight can guide optimizing injection strategies and cushion gas composition to balance efficiency and containment. In laboratory settings, the results suggest that experimental efforts can be streamlined by focusing on these dominant variables, thus reducing testing complexity while capturing the most impactful conditions for UHS performance.

Generalize the input parameters

To simplify the model, in this section the input parameters were combined and replaced with equivalent parameters that reflect their impact. Initially, all salts were replaced with NaCl equivalent, which was calculated using the following equation.

$$NaCl Eq. = \mathop \sum \limits_{1}^{n} Salt {\text{concentration*}}\frac{{Mw_{salt} }}{{Mw_{NaCl} }}$$
(8)

where \(Mw_{salt}\), and \(Mw_{NaCl}\) are the molecular weight of the different salt type and the NaCl salt, respectively. Having the salt equivalent instead of the individual salt, highlights the impact of the salt type and the total salinity simultaneously. Hence, the dependence of the IFT on the salinity highly improved and the correlation coefficient between the NaCl equivalent increased to 0.6 compared to the low R values for the individual salts (0.1–0.33) as shown in Fig. 17. While for the gas composition, the overall specific gas gravity (\(\gamma_{g}\)) was used to reflect the composition of the gas stream, in addition to the pseudo-critical pressure and temperatures (Ppc, Tpc) were used. Similarly, the correlation coefficient between \(\gamma_{g}\), Ppc, Tpc, and IFT improved to − 0.56, − 0.49, and − 0.48, respectively, compared to R values less than 0.47 for the individual gas compositions.

Fig. 17
figure 17

Correlation coefficients between IFT and input parameters: (a) salt equivalents vs. individual compositions, (b) gas equivalents versus gas composition.

RF model was developed using the newly derived input parameters. The results of this generalized model are illustrated in the accompanying figure, which shows performance across different datasets. The cross plots for the training and testing data sets reveal a near-perfect correlation between the actual and predicted IFT values, with R2 values of 0.99 for the training data and 0.98 for the testing data.

Additionally, Fig. 18 demonstrates the RF model’s ability to predict IFT values as a function of all parameters listed in Table 1 and the equivalent parameters used in this study. Both trends exhibit complete alignment, indicating that the equivalent parameters related to salinity and gas composition successfully capture the effects of varying salt types, concentrations, and gas compositions. This alignment confirms that these generalized parameters are reliable for representing the impact of different salts and gases on IFT. While simplifying the model inputs, the equivalent salt concentration metric may lose granularity when applied to other types of salts, which could impact the accuracy and generalizability of the results. As such, while the metric has proven effective for the salts tested, its application to other salts might require additional validation.

Fig. 18
figure 18

RF model results for IFT prediction based on the equivalent parameters, (a) training, (b) testing, and (c) validation data sets.

This study introduced ML models capable of predicting IFT values with high accuracy and speed. While experimental measurements also yield accurate results, they are constrained by the experimental limitations associated with high-pressure, high-temperature (HPHT) handling of H2, as well as potential human interaction effects. Furthermore, molecular dynamics (MD) simulations offer detailed insights at a molecular level. Still, they are restricted by substantial computational demands, limiting their use to smaller systems and shorter simulation times33,34. In contrast, ML models—particularly ensemble methods such as RF and Gradient Boosting—have shown great promise in managing large datasets and delivering rapid predictions across a wide range of reservoir conditions with significantly lower computational requirements. However, a primary limitation of these models is their potential lack of accuracy when extrapolating to new or untested conditions outside the range of the training data. Despite this, the models developed in this study offer significant potential for integration with reservoir simulation platforms and real-time monitoring systems. Their ability to generate fast, reliable IFT predictions can support dynamic scenario analysis, improve uncertainty quantification, and inform operational decisions in underground hydrogen storage (UHS) workflows. Moreover, when coupled with real-time field data, these models could enable adaptive control strategies, such as adjusting injection pressures or gas compositions, enhancing storage efficiency and safety.

Conclusions

This work uses machine learning (ML) techniques to improve the prediction of interfacial tension (IFT) in hydrogen–brine systems, addressing the complexities and limitations associated with existing experimental methods. The study created a generalized machine-learning model to estimate IFT under various conditions, such as temperature, pressure, brine salinity, and gas composition. The following are the main conclusions.

  • The Random Forests (RF), Gradient Boosting Regressor (GBR), and Extreme Gradient Boosting (XGBoost) models achieved superior predictive performance, with R2 values above 0.97, outperforming linear regression and decision trees models.

  • Residual frequency distribution plots and Average Percentage Relative Error (APRE) analysis confirmed that the ensemble models produced unbiased predictions with low variance and minimal bias, with residuals tightly centered around zero.

  • Sensitivity and SHAP analyses identified temperature as the most influential factor, significantly reducing IFT at higher values, while CO2 concentration and pressure also played critical roles.

  • Divalent salts (CaCl2, MgCl2) exhibited a more pronounced effect on increasing IFT than monovalent salts (NaCl, KCl), highlighting the importance of considering salt type variability in predictive models.

  • Introducing an equivalent salt concentration metric simplified the input parameters while retaining predictive accuracy, demonstrating its effectiveness in modeling the combined impact of different salt types on IFT.

The application of ML to forecast IFT in H2–brine systems is a noteworthy development in underground hydrogen storage (UHS) optimization. It provides a novel approach to boosting storage effectiveness and advancing sustainable energy solutions.