The opioid epidemic regularly garners national headlines, and numerous new and ongoing efforts are attempting to curb the overprescription of opioids in this country. For example, here in Indiana, IU has already invested in 16 projects as a part of the Addictions Grand Challenge. The university-government partnership is currently preparing for the second phase of proposals proposing novel solutions and interventions to address the issue in Indiana. This being a nationwide problem, the CDC has also put together numerous resources for anyone wanting to learn about the crisis, including a page dedicated to prescribing rate maps by state and by county, with data from 2006 to 2016. More information on the data source (the QuintilesIMS Transactional Data Warehouse) is available on the CDC maps site; I won’t go into that here.
The CDC maps are interesting but regrettably one-dimensional. There is a page for each year, with a cross-sectional look at the prescribing rate data for that year. The maps are interactive in the sense that you can zoom in and out and click to see the county names and rates for that year. But they are hard to compare across time because you have to go to a separate page to get the next year’s data. Additionally, the data isn’t directly downloadable, so you’re sort of stuck with those maps. Luckily, this is a great opportunity to do some creative data scraping and plotting! In this post I’ll go through how to make this map of the percent change in prescribing rates in Indiana by county over the 10 year period.
It is important to note when interpreting this map that it does not account for baseline rates (the data are not normalized), nor does it show the trend over time for each county. In general, opioid prescribing rates rose from 2006-2012 and have since fallen, creating an inverted-U shape trend. However, some counties have not declined to 2006 rates (those in red below).
The full R code to build this is available here, and I go through the process for scraping and calculating the bins below. Try it with other states!
Code walkthrough of building the map
The first thing we have to do is get the data… but it isn’t downloadable as a csv from the CDC website. But it is published directly below each map, which is great. Pretty much any website with a table on it is HTML scrape-friendly (as long as you’re allowed to scrape it) with some pretty straightforward packages in R.
# load the libraries we will need library(rvest) # for HTML scraping library(tidyverse) # for cleaning up our data and getting the stats we need, etc. library(choroplethr) # for the plotting capabilities library(choroplethrMaps) # for the map data underlying the visualization library(RColorBrewer) # for pretty colors
“Scraping” sounds like a big scary technical thing, but it’s not nearly as difficult as people make it seem. Some fantastic developers have created tools in R to make this really easy, especially for HTML tables. This can give us researchers access to some interesting and nontraditional data sources fairly easily. It all rests on the idea of an XPath, which is basically a pointer in the HTML code of a website that says “hey, here is this stuff on this page.” We can find the XPath of the HTML tables holding all of the data for the CDC maps by using the Chrome Developer Tools and exploring the “Elements” tab. Then we just copy that XPath and paste it into our R scraping function and voila – it finds our tables! This R-Bloggers post has screenshots that take you through how to find the XPath for the table you want to scrape.
Here’s an example for our 2016 CDC data.
##### DATA SCRAPING ##### # 2016 data url <- "https://www.cdc.gov/drugoverdose/maps/rxcounty2016.html" cdc2016 <- url %>% read_html() %>% html_nodes(xpath='//*[@id="contentArea"]/div/div/div/div/div/table') %>% html_table() cdc2016 <- cdc2016[] %>% mutate(year=2016) colnames(cdc2016) <- c("county","state","fips","rx_rate","year")
So we do this for all 10 years of data, only having to change the years and the URL it points to. See the full R file for all the iterations. Note to more advanced R users: yes, I should have made this into a function or for loop. ¯\_(ツ)_/¯ Maybe later.
Data Management Part 1: Constructing the data frame
Once we get all the data (for the whole country, 2006-2016 at the county level), there is some minor cleanup work to do to get the county names formatted correctly and rename a few things.
##### ORGANIZING AND CLEANING THE NATIONWIDE DATA ##### # bind the datasets opioid_rx_06_16 <- bind_rows(cdc2006,cdc2007,cdc2008,cdc2009,cdc2010,cdc2011,cdc2012,cdc2013,cdc2014,cdc2015,cdc2016) # filter to indiana # please try this with other states!! you don't even have to change the data frame names! # just swap in another state abbreviation :-) # if you do this, the cut() function will need to be re-done below with the conditional quartiles for that state in_opioid_rx_06_16 <- filter(opioid_rx_06_16,state=="IN") # get county names by themselves # the data from CDC has county, ST in the "county" field, which needs to just be the county name for labeling purposes # note: this is NOT for matching purposes, this is only for labeling that we keep this field # we match on FIPS codes, not on county name strings, which are problematic for a number of reasons in_opioid_rx_06_16$county2 <- substr(in_opioid_rx_06_16$county,1,nchar(in_opioid_rx_06_16$county)-4) # rename the column colnames(in_opioid_rx_06_16) <- "name" # convert the rx rate to a numeric variable for further processing in_opioid_rx_06_16$rx_rate <- as.numeric(in_opioid_rx_06_16$rx_rate)
Data Management Part 2: Percentage changes
Once we have our Indiana data from 2006-2016, we need to create a dataset with our percentage changes. So we really only need 2006 and 2016 and the delta. So let’s create that.
##### COMPUTING THE PERCENTAGE CHANGES ##### # net change 06-16 delta_opioid <- in_opioid_rx_06_16 %>% filter(year %in% c(2006,2016)) %>% group_by(name) %>% spread(year,rx_rate) %>% select(c(3:6)) # rename the columns; r doesn't like field names that begin with numbers colnames(delta_opioid) <-c("fips","name","rate06","rate16") # create our calculated fields to get the percent change delta_opioid$rawchange <- delta_opioid$rate16 - delta_opioid$rate06 delta_opioid$pctchange <- (delta_opioid$rawchange / delta_opioid$rate06)*100 delta_opioid$decline_flag <- if_else(delta_opioid$pctchange<0,1,0)
Note: Yes, tidyr/dplyr pros, there is a way to include these calculations in the first piping function. Go for it 🙂
Data Management Part 3: Binning our data
This is the part I spent the most time trying to figure out how to best approach. We’ve got a scenario here with our percent changes where some are negative, some are positive, and a few are zero. But the distribution is slightly skewed to the negative (which in this case is good, as it represents a decrease in the opioid prescribing rate). So we need to account for that and make sure to compare apples to apples. For example, if we just did deciles or quintiles or something like that, we run the risk of binning together slightly negative and slightly positive counties. But those should be in different groups; we need to divide our percent change data around zero and then bin among increasers and among decliners. To do this, I divided up both groups into quantiles within the group so there are four groups of decliners and four groups of increasers, plus a zero group. the way to do this is with the cut() function where you can define your cuts.
# find the cut points for bucketing our data # I wanted to make sure we had different color schemes for increases vs. decreases # this requires us to cut the data into buckets around 0 # the easiest way to do this is to just find the conditional quartiles # so, for those that decreased, what are the cut points for their quartiles? # then we use those values in a cut function to create the bins and categorize counties into quartiles summary(delta_opioid[delta_opioid$decline_flag==1,]$pctchange) # cut points are -22.543, -13.347,-7.231 for those that declined summary(delta_opioid[delta_opioid$decline_flag==0,]$pctchange) # cut points are 3.709,9.865,21.299 for those that rose # also 0 for those that didn't change # create the cuts delta_opioid$cuts <- cut(delta_opioid$pctchange, breaks=c(-1000,-22.543, -13.347, -7.231,-0.0001, 0.0001, 3.709, 9.865, 21.299,1000), labels=c("More than 22.5% Decline", "13.3% to 22.5% Decline", "7.2% to 13.3% Decline", "Less than 7.2% Decline", "no net change", "Less than 3.7% Increase", "3.7% to 9.9% Increase", "9.9% to 21.3% Increase", "More than 21.3% Increase"))
Sweet! Now we have our four categories of counties that rose, and four categories of counties that declined, with labels that will help make them interpretable on the map.
Last step: Choroplethr data frame & plotting!
Now that we have our data as we want it to appear on the plot, we can create the data frame that the choroplethr package requires and do our plotting! Note again that we are matching on FIPS, not on the county name. FIPS is much preferred, and we thank the CDC for so graciously including those in the tables on the website so we could scrape them.
# create the data frame for our choropleth chordf <- delta_opioid %>% group_by(fips) %>% select(c(1,8)) %>% ungroup() %>% mutate(fips=as.numeric(fips)) %>% filter(is.na(cuts)==FALSE) # change the column names for what the choroplethr map wants them to be colnames(chordf) <- c("region","value") # create the choropleth! county_choropleth(chordf, state_zoom = "indiana") + scale_fill_manual(name="Change since 2006", values=c("#2166ac", "#4393c3", "#92c5de", "#d1e5f0", "#D3D3D3", "#fddbc7", "#f4a582", "#d6604d", "#b2182b"), guide=guide_legend(reverse=TRUE)) + labs(title="Change in Opioid Prescribing Rate by County, 2006-2016", caption="Data from CDC.gov (http://bit.ly/cdc-opioid-data)\nNA counties have no data and appear white on the map") + theme(plot.title = element_text(color="#666666", face="bold", size=16, hjust = .2, vjust = -3), legend.title = element_text(color = "#666666", face="bold", size=14), legend.text = element_text(size=12), plot.caption = element_text(vjust = 5))