dynamite in R

Carolyn Saplicki
IBM Data Science in Practice
5 min readOct 18, 2023

--

By Carolyn Saplicki, Data Scientist

I recently worked on a project that utilized panel data. Currently, many packages have been developed to handle panel data such as plm, fixest, and brms, among others. However, dynamite stands out by offering unique features that make it a unique for handling complex panel data.

As dynamite is relatively new, there were few examples of its use outside of documentation. In this blog post, I’ll dive into a basic example of dynamite using PhysioNet Sepsis data. Here, I’ll highlight dynamite’s easy to use formulas.

dynamite

dynamite introduces the concept of Dynamic Multivariate Panel Models (DMPMs). These models, as implemented in the package, offer several advantages:

  • Flexible Response Distributions: dynamite supports a variety of response variable distributions (Gaussian, Poisson, Bernoulli, and categorical distributions).
  • Smooth Time-Varying Effects: dynamite allow for the estimation of effects that vary over time, using Bayesian P-splines with random walk priors.
  • Multiple Simultaneous Measurements: dynamite can handle different number of measurements per individual, providing a solution for complex data.

Sepsis Data

I began my analysis with sepsis data from Kaggle, which consists of records for different patients over varying lengths of time (Hours). This data was similar to the project I had been working on as it is complex. Not only do patients vary in time-studied but also there was missing data.

To handle missing data, I implemented Last Observation Carried Forward, LOCF. This technique involves replacing missing values with the most recent observed value for each individual. After imputation, if any patient had missing values, they were excluded from the analysis.

One of the challenges in medical data analysis is dealing with varying lengths of patient stays. Often, padding is utilized to ensure that all patients have the same number of time points. However, dynamite simplifies this process, as it can handle data with varying time points. This is a significant advantage when working with real-world medical data where patient stays can differ significantly.

Lastly, I adjusted the dependent variable by shifting it up one time step for each patient in the dataset. This modification aligns the dependent variable’s values with the previous time step for all patients. This change eliminates the necessity for using lagged variables throughout the modeling process. To use lag variables, click here.

Define and Estimate Models

For this blog, I will focus on the White Blood Cell (WBC) variable to predict sepsis.

Basic Model

Here, I create a formula to model the binary sepsis status (SepsisLabel) and White Blood Cell count (WBC) within panel data. This formula utilizes a Bernoulli distribution to handle the binary outcome. Within the dynamite’s fit function, the model takes into consideration the time-based nature (Hours) of the data and the grouping of patients by their unique identifiers (Patient_ID). The model subsequently employs Bayesian inference via Markov Chain Monte Carlo (MCMC) sampling to estimate parameters.

wbc_formula1 <- obs(SepsisLabel ~ WBC, family = “bernoulli”)
print(wbc_formula1)
wbc_fit1 <- dynamite(
formula = wbc_formula1,
data = test,
time = "Hour",
group = "Patient_ID",
iter = 2000,
chains = 4, cores = 4, refresh = 0
)

Time-Varying Model

This formula looks very similar to the first; however, there are some key differences. Here, we’re introducing a varying effect for the predictor variable “WBC.” Additionally, the “-1” included in the model formula will create a time-varying intercept. This is especially helpful when the label variable is not consistent across all time frames.

To utilize WBC as a time-varying effect, we need a spline. The spline models how WBC changes over time, capturing non-linear patterns. In this case, the spline is designed with 10 degrees of freedom (df=10). The term noncentered = TRUE indicates the use of a non-centered parameterization for the spline. This approach is often preferred due to its ability to enhance the convergence and stability of the model.

wbc_formula2 <- obs(SepsisLabel ~ -1 + varying(~ WBC), family = "bernoulli") + 
splines(df=10, noncentered = TRUE)
print(wbc_formula2)
wbc_fit1 <- dynamite(
formula = wbc_formula2,
data = test,
time = "Hour",
group = "Patient_ID",
iter = 2000,
chains = 4, cores = 4, refresh = 0
)

Results

Note: For these results, I utilized fitted() with ndraws=1.

Basic Model

The ROC score is 0.63 for the training dataset and 0.63 for the validation dataset. Here, the model’s ability to discriminate between sepsis and non-sepsis cases is better than random chance but still has room for improvement. It indicates that the model has some predictive power, but it may not be capturing the full complexity of the relationship between the predictor variables and sepsis status.

Below, we can see the alpha and beta for this model.

Varying Model

In this model, the ROC score improved significantly to 0.76 for the training dataset and 0.76 for the validation dataset. This is notably better than the scores in the first model. It indicates that the model, with the inclusion of the varying parameter, is more effective at distinguishing sepsis from non-sepsis cases.

Below, we can see the alphas and deltas for this model. Here, we can see the varying WBC coefficient to capture how the effect of a predictor variable changes over time. Additionally, I added a time changing intercept. Here, we can see this value increases overtime.

Runtime Warnings

When introducing more variables to these models, I would face runtime warnings that often related to convergence issues. I found increasing iterations (iter) and warmup in the model fit helped to create more stable models. Additionally, there are many parameters that can be tuned in each of the formula building pieces. For instance, in splines with varying predictors, I added noncentered=TRUE. This creates a noncentered parameterization for the spline coefficients and is utilized when divergences are encountered.

More Tools & Resources

Tools

There are many capabilities in this package such as group-specific random effects, latent factors, lagged effects, customized priors, summarized predictions, visualization tools, and support for various probability distributions. I suggest a read of the documentation.

Resources

Conclusion

dynamite is a game-changer for investigating panel data. Its ability to handle complex data, smoothly varying effects, diverse response distributions, and fully Bayesian estimation empowers researchers to unlock deeper insights from panel data. This R package is a valuable addition to the toolkit of anyone working with panel data in various fields, from social sciences to economics and beyond.

--

--