The Background. As a rule, I don’t pick March Madness brackets. At least not for real. My favorite part of the tournament season is to come up with the most convoluted and esoteric criteria I can for picking games and then put together my bracket based on those criteria. For example, a few years back I did a bracket entirely based on the middle initial of the university president. These approaches are not your every day “fighting-mascots” arbitrary bracket selections. I take pains to ensure the methodology is unrelated to basketball and university stature. I primarily use this tournament as an opportunity to learn a little something about each of the teams and universities by looking up these details. This year, I’ve taken a more programmatic approach to the picking, mainly to brush up some R skills. The best part is that it’s all done in R, and it gives me a reason to post about using R to scrape Wikipedia pages and analyze text, so those of you at home can follow along. And rest assured, there are interesting plots at the end. If you want to see the actual bracket predictions, the completed bracket can be found here. R script file available here.
The Method. This year, I’m using a mathematical score for each team to use in the chance decision of who wins each game. The score is computed by taking 1) the number of vowels in the name of the arena in which the team plays and dividing it by 2) the number of consonants. This ratio is then multiplied by 3) the number of years the arena has been open. So each team gets this score assigned. But how to choose the games? Obviously we can’t simply choose the teams with the highest or lowest score in each game, because that would be boring and be biased towards teams with very old or brand-new arenas, respectively. Instead, we’ll just flip a coin for each game in the bracket, with heads meaning the higher score team wins, and tails meaning the lower score team wins. You may be thinking that this coin flip basically removes any significance the score has in prediction. Yep. ¯\_(ツ)_/¯ My main goal of using these methods is to learn something while protecting myself from doing worse than simple chance (at least through the first round… after that it’ll get dicey). Luckily for us, coin flips are easy to simulate in R for the 67 games of the tournament (this includes the 4 play-in games), so it doesn’t take too long.
The Process. Lucky for me, someone has compiled a list on Wikipedia of NCAA Men’s Division I basketball arenas. So we can get this data into R for some text analysis relatively easily using the xpath of the table in the HTML of the website. For info on how to get the xpath, see this blog post.
# Packages: library("pacman") p_load(tidyverse,rvest,tm) url <- "https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_basketball_arenas" arenas <- url %>% read_html() %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table') %>% html_table() arenas <- arenas[]
Now we have to look through the table and figure out what we need to clean up in the variables we are going to use. First off, we see that there are some annotations in the Arena, Opened, and Capacity fields. These are references, and we want to remove them because they will mess up our calculations if they stay in the data.
# remove citations from Arena, Opened, Capacity, & Conference fields arenas$Arena <- gsub("\\[.*?\\]", "", arenas$Arena) arenas$Opened <- gsub("\\[.*?\\]", "", arenas$Opened) arenas$Capacity <- gsub("\\[.*?\\]", "", arenas$Capacity) arenas$Capacity <- gsub("\\,", "", arenas$Capacity) # remove commas arenas$Conference <- gsub("\\[.*?\\]", "", arenas$Conference)
Now we can do our actual calculations of the score.
# Calculate years the arena has been open arenas$yrsopen <- 2018-as.integer(arenas$Opened) # Calculate the Vowel:Consonant Ratio arenas$name <- tolower(arenas$Arena) # lowercase arenas$name <- removePunctuation(arenas$name) # remove punctuation arenas$name <- str_replace_all(arenas$name, fixed(" "), "") # remove spaces arenas$namechars <- nchar(arenas$name) # count total letters in name # add up vowels arenas$vowels <- str_count(arenas$name,"a") + str_count(arenas$name,"e") + str_count(arenas$name,"i") + str_count(arenas$name,"o") + str_count(arenas$name,"u") # calculate consonants arenas$consonants <- arenas$namechars - arenas$vowels # calculate ratio arenas$vtocratio <- round(arenas$vowels/arenas$consonants,3) # multiply by the years arenas$vcXyrs <- round(arenas$vtocratio*arenas$yrsopen,2)
At this point, we have a data frame that is 352 observations (one for every NCAA Division 1 basketball team) and 15 variables per observation. We will want to trim it down to just the fields we care about to make it easier to find the scores for the teams actually in the tournament.
# trim it down for using with the bracket bracketdf <- select(arenas,c(5,15,2,7,14)) %>% arrange(Team)
I said before that we needed to flip a coin for each game, and that R lets us simulate this quite easily. So let’s go ahead and simulate 67 coin flips and see what we get. Notice I set a seed at the beginning that is the average capacity across all arenas. This is so that I can get replicable results for the coin flip. If you set the seed differently, you’ll get a different set of coin flip results, but if you set it like I have below, you’ll get the exact same results.
# game flips to see if higher or lower score wins set.seed(as.integer(mean(as.integer(arenas$Capacity),na.rm=TRUE))) games <- sample(c("Lower","Higher"), size = 67, replace = TRUE, prob = c(0.5,0.5)) games # console output: > games  "Lower" "Lower" "Lower" "Higher" "Lower" "Lower" "Higher" "Higher" "Higher" "Higher" "Lower"  "Higher" "Lower" "Lower" "Lower" "Higher" "Lower" "Lower" "Lower" "Lower" "Higher" "Lower"  "Lower" "Higher" "Higher" "Higher" "Lower" "Higher" "Lower" "Lower" "Lower" "Lower" "Lower"  "Higher" "Higher" "Lower" "Lower" "Higher" "Lower" "Lower" "Higher" "Higher" "Lower" "Lower"  "Lower" "Higher" "Higher" "Lower" "Lower" "Lower" "Higher" "Higher" "Higher" "Higher" "Lower"  "Higher" "Higher" "Lower" "Lower" "Lower" "Higher" "Lower" "Higher" "Lower" "Higher" "Higher"  "Lower"
Finally, as promised, some visualization of this data. One interesting thing about these arenas is that there might be some trends in capacity as newer arenas are built. So here’s a quick ggplot for that.
# plot the capacity by year opened and shade by conference ggplot(arenas,aes(x=as.integer(arenas$Opened))) + geom_point(aes(y=as.integer(arenas$Capacity),color=Conference),alpha=0.4,size=5) + theme(legend.position = "bottom") + labs(title="Capacity by Year Arena Opened & Conference", x="Year Opened", y="Capacity")