This week, I was lucky enough to have the opportunity to participate in a health data hackathon in Haifa, Israel with a team from the Regenstrief Institute. We were invited to compete by colleagues at MDClone, a health data company based in Beer-Sheva, Israel. Each of the 22 teams at the hackathon selected one of three “challenges,” which generally focused on predicting disease risks. We focused on prediction of congestive heart failure (CHF) among diabetes patients. We were provided with synthetic datasets based on real clinical data from a health system in Israel, and basically instructed to create something useful, generalizable, and clinically important. We created a prediction algorithm using a random forest approach, and then visualized results in an RShiny app. And incredibly, we ended up taking second place in the competition!
We opted to first build a prediction model and then focus our energy on creating a visualization tool for clinicians. The question guiding this thinking is the one we hear from clinicians ad nauseam: “Why?” A major challenge in clinical risk prediction algorithms is that a person’s risk score itself is rarely clinically useful. Telling a provider that a person is “at high risk” for something may raise awareness of that disease or issue, but rarely does risk status directly inform care. Rather, providers want to know why that person is at high risk, so they can intervene on that particular risk factor. However, person-level risk factors aren’t always readily available from risk prediction tools. Population-level risk factors are easily identified, but not those specific to a person. Risk prediction models that use sophisticated machine learning methods are typically described as “black boxes” that spit out a predicted probability or weighted/ranked classification. However, these models often do not explicate what is driving the probability or classification for a particular person or observation. Understanding this is increasingly important as access to big data and machine learning are democratized and used in more settings. For example, a garden variety random forest (like the one we used) may have hundreds of variables included in the final classification tree, with thousands of ways to combine those variables in pursuit of strong predictive performance. But each “high risk” person isn’t impacted by the same subset of variables that drive their risk, even within the same model. Two people who both have a 75% predicted probability of developing CHF may have entirely different sets of factors driving that final prediction, and almost certainly differ in the primary driver of their risk. And these drivers are what clinicians really want to know – factors like specific medications (and details like number of fills), other comorbidities (and details like time since diagnosis), and laboratory results (and details like how much they fluctuate) matter differently for different people, and making those person-level differences easy to see has thus far proven difficult in clinical risk prediction models. Therefore, our team set out to create something to address this challenge, which I describe in detail below.
Using an example dataset of houses in Boston, I built a demo version of our application with all of the same functionality, for you to explore if interested. We can’t publish the clinical risk prediction app publicly, but the demo is useful for showing how we visualized observation-level risk factors.
Our Team’s Approach
We started with a clinical data set of 65,000 control patients and 1,800 case patients. All of them had diabetes, but the cases developed CHF. Each person had 160 clinical variables documented in the dataset, which was arranged at the individual level. The variables included age, gender, smoking status, time since diabetes diagnosis, time since diagnoses for a number of other diseases, body mass index (BMI), number of packages dispensed for a host of relevant medications, and most recent value, median value, and standard deviation of a number of laboratory tests. We first generated features to simplify some of the clinical variables. For example, instead of “time since diagnosis of chronic kidney disease” we flagged patients as “having CKD” or “not having CKD.” We did this for all diagnoses, then created a diagnosis count variable; a sort of simplified Charlson Comorbidity Index. We also classified the medications into categories like “beta blockers” or “hypertension medications” and created flag variables for whether or not a patient was on any medications of that type. We created a medication count variable similar to the diagnosis count variable, and finally classified BMI into categories based off standard clinical cutoff points.
Once this final dataset was created, we did feature reduction based on data missingness. Because prediction models generally require “complete cases,” i.e. if any variable is missing for a person, their whole record is thrown out of the set, we wanted to remove variables with high missingness so as to preserve as many cases as possible. This is largely due to the fact that there were so few cases. We considered oversampling from the case set, but ultimately decided not to take that approach. We removed all variables with more than 5% missing values. This brought our final analytic data file to a total of 77 features (and one outcome feature: CHF).
We implemented a random forest classification model to predict CHF using all features. We tuned the model using 10% holdout validation, using 200 trees and three different values for the number of features randomly selected at each split for classification. The model using the square root of the number of features (the default for classification in the R package randomForest) performed best in that it had the lowest mean squared error, so that became our final prediction model. As I noted before, our focus was on building a tool to visualize any prediction model results, so the random forest approach we used was straightforward and did not exhaust the parameter tuning options available in R.
Once we had our model, we executed the random forest classification on the entire dataset, generating predicted probabilities of CHF for each person. Importantly, random forest models allow for generation of what are termed “local importance factors,” which essentially score the variables in the model with respect to how much they influenced the predicted probability of CHF for each person or observation. Overall model variable importance levels are often reported, but these reflect population-level variable importance, and do not necessarily reflect the specific risk factors that are contributing to a given individual’s predicted risk. Local importance factors allow for exactly this level of insight. With the model predictions and importance factors for each person in our data, we could build the visualization application.
We built an interactive visualization app using RShiny that allows the user to simultaneously view population- and person-level risk of CHF development (screenshot of demo app below).
When the app is launched, the user sees a scatterplot of all the predicted risk levels of each patient, with higher risk patients higher up on the plot. This plot illustrates the population distribution of risk: who is high risk? who is low risk? how many of each risk group are there? how is risk distributed? To show this distribution of risk, there is an overlaid violin plot that captures the probability density function of the risk scores. Each point represents an individual, with color corresponding to whether or not the model predicted correctly (blue if correct, orange if incorrect). Also, the points have different shapes based on risk level. Low risk are open circles, “rising risk” is a circle with an X, and high risk are filled-in circles. For this challenge, we defined high risk as anyone with a risk score (predicted probability of CHF) in the 95th percentile and above, and rising risk were those patients between the 80th and 95th percentile. All others were classified as “low risk.”
As users review the population distribution on the left, they can double-click any of the points to generate a person-level plot of variables influencing that person’s predicted risk. The app dynamically takes the individual ID from the population plot and retrieves the local variable importance measures for that person. Then, it identifies the top 10 most important features by absolute value and plots their importance values relative to one another. Importantly, this allows for the x-axis to change dynamically for the person-level plot, as different variables matter for different people. One person’s smoking status may be the major driver of their risk prediction, while another may be driven largely by the number of dispensed packages of a particular diabetes drug.
As you can imagine, this app could be laid on top of any risk prediction model, as long as that model is capable of generating person-level factor weightings. We hope that this can better inform clinicians and illustrate specific risks that should be prioritized for intervention.
A note: a few of you R/machine learning nerds will note that for random forest examples, the mushrooms dataset is far more popular. You’re right. But for these purposes, random forest actually performs *too well* on the mushrooms data, and I needed something more representative of the uncertainty in clinical prediction algorithms. You’ll also undoubtedly note that the outcome variable in the demo app, “chas,” is a flag for whether or not a house in Boston is on the Charles River. Kind of a silly thing to predict, I know. But for the purposes of the example, I needed a 0/1 outcome and “chas” won the day.