The current tax reform debate and congressional procedures in general have me watching a lot of CSPAN lately. As I’ve followed this debate, I got curious about something that became more and more apparent: I knew Congress was old… but exactly how old? This seemed like a fun, easy data visualization task, and here we are. I found a website that maintains a nice simple HTML table of names, current ages, term length, party affiliation, and some other information about the Senate, and I started there by scraping the data into R. If you want to run this yourself, the R code is below the graphic. Note: it may require you to install some packages via the code commented out at the top.
I’m not going to go into the scraping/data cleaning process because it does take a little bit of wrangling to get the table data formatted to what we need for the visualization. There are some nasty extra characters in the table that for whatever reason the readHTMLTable() function doesn’t handle well. I’m open to suggestions for other HTML scraping packages/functions that improve the fidelity and clarity of the data. The upshot of using this website in particular is that it appears to be a daily-updated table, so the ages will be current if you re-run the code in, say, a year from now.
Right now, the visualization is flat and not interactive, which is less than desirable. I’ve got an interactive one in the works. If you run the code below, it will generate an interactive plotly version of the graphic with tooltips showing the age, term length, party, and the senator’s name.
# get our packages loaded... you may have to install some of these # install.packages("XML") # install.packages("RCurl") # install.packages("rlist") # install.packages("ggplot2") # install.packages("stringr") # install.packages("plotly") library(XML) library(RCurl) library(rlist) library(ggplot2) library(stringr) library(plotly) theurl <- getURL("https://infogalactic.com/info/List_of_current_United_States_Senators_by_age",.opts = list(ssl.verifypeer = FALSE) ) sen_ages <- readHTMLTable(theurl) sen_ages <- sen_ages$`NULL` # the last row is erroneously included, trim it off sen_ages <- sen_ages[1:100,] # convert the current age field to character for parsing as.character(sen_ages$`Current age`) # grab the age in years out of that string sen_ages$Age <- as.numeric(substr(sen_ages$`Current age`, 21,22)) # char conversion for party sen_ages$Party <- as.character(sen_ages$Party) # independent has some weird  after it, remove that sen_ages$Party <- str_replace_all(sen_ages$Party,regex("^Indep.*$"),"Independent") # make it a factor to use in the colorization sen_ages$Party <- as.factor(sen_ages$Party) # clean up the senator names sen_ages$lastname <- as.character(sen_ages$Senator) sen_ages$lastname <- gsub(",.*$", "", sen_ages$lastname) # grab the term length variable sen_ages$TermLength <- as.numeric(substr(as.character(sen_ages[,7]), 21,22)) # some are single digits, so just trim off that white space sen_ages$TermLength <- trimws(sen_ages$termlen,which = "right") # make sure they are numeric sen_ages$TermLength <- as.numeric(sen_ages$termlen) # define our colors for the parties plotcols <- c('Democratic'='blue','Republican'='red','Independent'='green') # plot it plot <- ggplot(sen_ages, aes(x=Age)) + geom_histogram(aes(Age), alpha=0.3, bins=10) + scale_fill_manual(values=plotcols) + scale_color_manual(values = plotcols) + geom_vline(xintercept = 81) + geom_vline(xintercept = 65) + geom_point(aes(y=TermLength, color=Party, text=c(paste("Senator:",lastname))), alpha=0.6, size=4) + labs(x="Current Age in Years", y="Years in Office", title = "Age Distribution and Term Length of Current Senate") + annotate("text", x = 50, y = 19, label = "Histogram shows overall distribution of ages in Senate") + annotate("text", x= 70, y = 0.5, label="Vertical lines at 65y (retirement age) and 81y (female life expectancy)") + theme(legend.position="none") plot ggplotly(plot)