For the past year or so, whatever spare time I had outside of work was spent towards learning a new set of skills in the field of Data Science, and working on a capstone project that I am now proud to show below.
I wanted to focus on a project in an area that I have a passion for, along with one that could lead to further applications. I knew I needed a good set of data to begin with, and then be able to glean information that would be worthwhile predicting some outcome or variable. I ultimately decided to go for data surrounding ultramarathons, specifically 100 mile races.
There are a few different sites that I thought to pull data from (UltraSignup, Strava, RealEndurance, or Trail Runner Project), but I wasn't able to properly scrape the data. Instead, I was able to take data from race websites' results page, and focused on what is the gold standard for 100-mile races - the Western States Endurance Run (WSER).
The race has some interesting lore behind it, but it is consistently one of the most competitive races year after year (it was ranked #1 for 2016, and again in 2015 by ultrarunning.com), and was also the very first 100-mile ultramathon. It is most often a metric to compare other 100-mile races to. Therefore, I figured it would be a good race to evaluate and analyze for my project.
While most may be familiar with a traditional marathon of 26.2 miles, an ultramarathon is any foot race over 26.2 miles. More often than not, these races take place on the trails of state and federal lands (sometimes looped, other times point-to-point), and often involve lots of vertical climbing and descending to go with the horizontal distance (known as "vert" for vertical gain).
In terms of physiology, for a 100-mile race, the body does not contain enough readily available resources for the mind and muscles to last the duration of these events. Depending on the terrain, these races can take anywhere from 14 hours for the superhumans, to 24 and 36 hours and beyond. Aid stations are setup along the course to fuel and hydrate runners, and serve as a chance for rest if needed. The aid stations can also double as check points for timing and to verify a runner is on course, or offer them a chance to drop if they deem necessary.
Lastly, while the point of a race is to go as fast as possible, 100 miles is a long distance and requires smart planning and execution, otherwise finishing may not even be an option. For some runners that may mean to run up the hills and spend little time at aid stations, while for others it is wiser to power-hike up a hill and conserve energy for flatter and downhill sections.
One disclaimer is this project is attempting to break down the different factors available and use those to best predict whether or not a runner will reach the finish line. I personally have experience running ultras and understand there is a lot more that can determine whether a runner will finish aside from what I outline below. Keep in mind, I am working with a limited data set.
The WSER site, where I pulled the data from, not only includes the final times and overall places for each race, but also has a detailed results page that includes all splits into the aid stations for each year.
The format of the file was not always consistent and required a fair amount of wrangling.
One of the first steps was determining which aid stations were used in a given year, since they varied. I was able to normalize each aid station based on its name, distance from the start, and whether the measurement was for a runner coming into an aid station, or leaving. Below is an example of what the final data looked in excel before I exported it to a
*.csv and imported it to R - the programming language I used to analyze and create a predictive model with.
Once in R, it was necessary to wrangle the data into a format that would be most conducive to analysis. Information pertaining to the runner's overall stats (name, age, gender, overall time, place, etc.) were formatted in a style that was easy to work with. Like described above, the aid station data was not always consistent, but critical to analyzing the race. Below is an example of my methodology for how I categorized the aid stations and the information extracted.
|AS Name||Variable Name||Distance (miles)||In or Out||Type of Measurement|
|Red Star Ridge||
|Red Star Ridge||
Even after having the data in a standard and manageable form, there were still lots of steps to go through to analyze everything. Different challenges included accounting for the way the time was recorded at aid stations on different years (i.e. time of day vs. elapsed time since the start of the race), removing missing data, writing some fun
REGEX, and doing a bit of feature engineering (adding variables or calculating different measures).
Ultimately, I was able to get data back into a format where each line represents a single racer per year with all relevant information for the racer for that particular race year.
I had to make a few choices on things to cut out. One variable I excluded was time out of an aid station. These entries proved to be inconsistent throughout the years, and even amongst aid stations. Therefore, the final data frame only contains the time a racer came into an aid station.
A lot of the legwork involved with this project required writing code in R to wrangle and ultimately perform all the analysis. I have left it out of here, but you can view all the code for it here.
Below is an outline of the variables I was able to use for analysis. I am excluding the list of all the aid stations, but have that list created that I'll link to in the future.
||chr||Unique ID of year and overall place (e.g. 1995-3)|
||num||Year of race (1986-2016)|
||factor||Year of race as a factor|
||int||Overall finish place|
||chr||Last name of runner|
||chr||First name of runner|
||chr||Bib number of runner (mostly numbers, some alpha numeric)|
||chr||Male or Female (M or F)|
||int||Age of runner for given year|
||num||Index number for 7 bins - (0,20,30,40,50,60,70,200)|
||num||Overall time, in minutes, for race (NA if DNF)|
||num||Finish status - "Finished" = 0, "DNF" = 1|
||int||Decile rank based on overall time vs all years|
||num||Overall pace for duration of race (assumes distance of 100 miles)|
||num||Cumulative time / distance of aid station from start|
||int||Decile rank based on overall time vs given year|
|** AID STATIONS **||num||Time elapsed (in minutes) to reach given aid station|
||num||Average pace between each aid station|
||num||Standard deviation of the paces between each aid station|
||num||Maximum number of aid stations given runner checked in to|
From here, I was able to start playing with the data and graphing different variables to see what's really going on over the years and with the runners.
I went into some detail on specifics of the exploration of data in a separate post, but have included some of the graphs here as well.
To get a feel for what the aid station data looked like, I selected one aid station and pulled out all the times runners took from every year. The graph below is all times from the 2nd Aid Station. The years that are abnormally faster are years in which the course had changed. Typically, the 2nd aid station is Red Star at mile 16, but for snow years, it was Talbot Creek at mile 15.2.
If we build upon the above but this time, collapse all the years, and show the range of times into each aid station based on their distance from the start, then we can represent how the cumulative pace changed at each aid station. The black dot represents the average for all years.
At the outset of this project, the goal was to capture finish times and evaluate that. The evolved over time as I came to understand what I had and what would be useful. I think there are good data with some interesting trends with regard to total finishing times. Below are a few of the explorations into the different times, categorized a few different ways.
Below is a general plot of all finishing times, colored by years.
The first thing to notice is how there is a change in the shape of the curve about halfway up. The change in shape is the 24-hour mark (1440 minutes). My thought here is runners notice they are within reach of breaking the 24-hour time, and adjust their speed accordingly.
By ranking each runner per year based on their overall finish time compared to all finish times recorded, I was able to create a set of 10 quantile groups (deciles). Coloring by a runner's ranked group is below.
To understand what type of pace is required to get into each of the decile rankings, a separate plot for mean finishing times of each group will be useful. The graph is shown below and is an example of a take-away that a potential entrant can evaluate and perhaps apply to their training if they wish to make a certain group (without taking anything else into consideration, that is). I think below helps to understand what sort of speeds are achieved at this higher level breakout. I think the pace for the first group could be broken out even further since the elites seem to go so fast!
This graph below is a bit convoluted... so stay with me. I created two different rankings. One ranking as shown above - where a runner's time is compared to all times from all years. The other ranking compares a runner's time with only the year they raced. By comparing these two, I thought it might shed some light on whether a runner was faster for their respective year but slower overall, or maybe fast for both their year and all years. This ultimately can tell us if a year was particularly faster or slower when compared to all time.
I wanted to incorporate gender as a predictor variable in the end, but wanted to see what the breakdown of its distribution was.
The split between Men and Women is quite one-sided (in aggregate, 18% of entrants are female). Because of this, I did not include gender into building my final model due to a heavy bias towards males. While there are enough data points to come up with a similar model for females, I opted to exclude the gender from the final model.
Looking at the finish times of both genders sees the same sort of shape between the two, overall.
Another variable to consider with each runner is age. I broke out each age into age groups usually used in races (shown below - box plot seemed the most apt here). What is interesting is the number of runners in each category - second image down. The cause for a small number in the younger (and faster) age groups, I suspect, has to do with the average age of ultra runners, and lottery process to get into the race.
All the plots above are essentially an exploration of those that finished the race. While there is still valuable information in the plots and associated data above, what really interested me was what distinguished a person from DNF'ing (Did Not Finish) and finishing.
First, a histogram grouped by those that finished and DNF'd.
Overall - the DNF rate is 32%. To continue exploring this further, it was necessary to look at how fast runners went into the aid stations and how their rank may have changed over time.
I was able to break out each runner's times into aid stations and extract some valuable information from that. I could evaluate a runner's decile rank at each aid station (
qRank.as), overall pace at each aid station (
total distance to aid station/
time elapsed =
pace.as), and the differential pace to reach one aid station from the previous (
diff.pace). From there, I was able to produce summary statistics showing the average pace into each aid station (
pace.avg), the maximum number of aid stations a runner entered (
as.max), and the standard deviation of a runner's pace through all the aid stations (
By plotting each runner's (x-axis) pace into each aid station, the plot below gives a good idea of the speed with which each place (finisher or not) ran with, and how that changed throughout the race. Keep in mind, for every tick on the x-axis, there are multiple paces for each aid station that person hit, along with multiple runners at each place for the different year.
By then breaking out each runner into their respective decile ranking, it becomes easier to see how the pace over time changes with each group. I find this one of the most fascinating graphs. Especially to see how the slope of the average time changes for each group.
Adding a color group to the data for a runner's rank into each aid station helps (or convolutes) to show how top runners stayed in the top rankings throughout the race (same color along a vertical section), while those in the DNF section seem to have more color change throughout their grouping. Keep in mind, again, that the x-axis represents the finishing place, while the color represents a runner's decile rank into multiple aid stations for a single year' race.
When I extracted out the average pace between aid stations, I was then able to compare how the pace of those that finished differed from those that did not. Note - this pace between each aid station is the average of the differential pace between aid stations, while the pace above is the cumulative pace up to certain aid stations.
The reason there are so many different pace ranges for those that DNF'd is the number of aid stations that they may have gone through. Sometimes a runner will only go through the 1st or 2nd aid station before callling it quits, and may have gone too fast. Whereas, those that finished seem to have a fairly defined curve by making it through each aid station.
The standard deviation became of more interest to me as that feature may help reveal how consistent they were and whether or not a runner was about to DNF or able to make it through. This graph explores a bit more of the graph above with the trend lines for each finishing group.
The standard deviation of a runners pace maintains a similar curve to it, while those that DNF'd have no pattern to them.
The last few graphs on aid station timing and pacing became the focus for my final project. I wanted to be able to find the pattern and factors that are most significant, and to know what was the major cause for a person finishing vs not finishing the race.
By establishing a baseline for what the data can do, there are a few different things that can be predicted.
The main goal with my project is to predict whether or not a runer will finish. We'll make the model based on the pre-race information (age, sex, bib) along with information from along the course to help determine if the runner will make it or not. While the other two aspects (items 2 and 3 listed above) for prediction in the race would be extremely helpful, they will need to be addressed at a later time.
Taking in the different variables and understanding what the exploration above showed, it will be important to select the most impactful features and ones that minimize multicolinearity.
Below is a graph of variable importance using the function
boruta from the package
Boruta (more information can be found here). I won't get into the details, but it goes through a Random Forest model to parse out the importance of the different factors and affect they have on the output variable.
Now, a correlation plot to show how those features are related, so as to better select the proper variables.
variable importance plot above, the two most impactful features are
as.max. It can be easy to just pick those to use in the model, but we'll need to understand what they mean and if we should actually use them.
as.max represents the total number of aid stations a runner hits, it makes sense that if a runner made it to more aid stations, they are likely to finish. So this variable, while highly correlated with DNF, can show that basically if a runner makes it to, let's say, aid station #8, they have a more likely chance of finishing the race.
While useful and interesting, the number of aid stations varied between the years due to various reasons, the graph below shows the different aid stations over the years. Because of how the number of aid stations changes, the variable
as.max would not accurately model whether or not a runner will finish. This also sheds light on why
YearInt are highly correlated and why
Year) will be excluded as well.
Place is the runner's final finishing placement, and therefore is similar to a runner's DNF status since a DNFers
Place is given after all finisher's
Place is assigned.
ageGroup is just a larger sub-category of
Age and only one of those variables will be included. Races typically categorize based on age group, but since
Age seems to have more importance, that will be the variable we use there.
bibNo is a runner's bibNo. While there really isn't much of a correlation here, there are certain bib numbers that are reserved for elites or those that get into the race a certain way. Therefore,
bibNo offers a chance of helping to add in some predictive power to those with the fast bibs.
The last variable to be used here has to do with pace. Both
pace.sd are taken from a runner's time between each of the aid stations they reached. Calculating
pace.avg just takes the distance between each of the aid stations and gets a pace for each section, then takes the mean for each runner.
pace.sd is the standard deviation of those results different paces as well.
From here, the goal will be to come up with a model that uses the data above, and best predicts whether or not a runner will finish the Western States Endurance Run. For my final project I did go through all the different models, and if you're interested, feel free to reach out; otherwise, below is a summary of my model.
To start, a baseline prediction for anyone starting the race. This way, any predictive models should be able to improve upon the accuracy of this basline.
Taking all the results and tabulating those that finished (
0) and did not (
That means the base model will predict a runner will finish with an accuracy of 67.8%.
I went through a series of different models, not only to practice and understand what they meant, but also understand which types of models would give the best results.
The method of testing takes a random set of the runners, and based on the variables I'm using for the model, predicts whether or not a runner will finish. The predicted outcome is then compared to the actual result to establish the accuracy of the method. I used a split of training and testing data, to first train the model (training), and test the model's accuracy (testing).
Methods that are I used:
The linear model is a first shot at determining a decent regression model for the data here. There are limitations to a linear model, and it is not the best at a binary classification. The accuracy on the test data is what will be used to compare all the different models.
Logistic regression may offer a better fit than linear regression as it has the capability of classifying an output into 1 of 2 groups.
With the SVM model, we're able to see much different results. Potentially due to the fact that the final classification we're seeking (Finish or DNF), has some heavy outliers that a linear model won't necessarily capture, and SVM has the ability to classify a bit better.
Regression trees went through a series of iterations with each variable and determined certain cut-off points on where certain outcomes could be decided. Following the output from this model was what I found the most helpful. The accuracy of it is shown and discussed below.
It may not be completely applicable to use a neural network for this model with the limited variables and binary classification, but nonetheless, it is worthwhile to go through this model and see what results we can find.
After running all the different models, and fine-tuning them a little, I was able to get a nice tabulated series of results, along with a nice output graphic comparing the different models.
|Model Name||Accuracy Results|
|Support Vector Machine||76.3%|
By looking at the results of all the models and comparison graphs, the final results show us that, based on pure numbers, the random forest model gets us the best results of predicting a finish at the Western States, with an accuracy of 77%. While the difference in accuracy is not all that different, it is still noticeable, especially from the GLM model or the baseline of 68%.
Based on the ease of reading the decision tree (CART model), and only a 2% difference in accuracy from the Forest to CART, I would recommend using the decision tree for atheletes curious about what their training might lead to and how to plan a race.
For instance, if a runner were to have a training program put together where by the end of it, they knew they could maintain a 15 min/mile, while maybe fluctuating 4 min/mile depending on the vert or descent of a particular section, they could accurately predict they would have an 80% of a solid finish.
However, if the training has been slacking and maintaining even a 19 min/mile is tough, it is better to stay as consistent with the slow pace as possible (57%).
Going forward, I think there is still a lot of work that can be done with this data and lead to further insights.
A few other things can be done with the data scraping aspect of the project. While I was able to extract what I deemed the most useful data, more feature engineering can take place with some more of the fields (e.g.,
% of total aid stations,
time spent at aid station), along with capturing a few more of the fields I had to exclude due to length of time to data mine (e.g.
Another way to split and perform regression on the data would be to use clustering and understand the difference between the elites from the normal runners, or if they differ at all aside from their shear speed.
Further data to gather, not present in this dataset, would be the addition of elevation data along the course and training data. Both of these would be extremely helpful, and would take this model to a new step.
Training data is probably the biggest determining factor for whether or not a runner will finish. Training data would give the ability to create a model and determine what sort of training is most apt for this race, and could even allow for a better understanding of ultra marathons as a whole. By creating a model that is inclusive of elevation gain and miles per week, coaches and runners could use training tools such as Strava and Training Peaks to really hone in their approach for a specific race.
Predict finishing time and place and anticipated arrival at aid stations.
What this model shows is the ability to focus in on specific variables and help a runner determine whether or not they have the tools to finish. But this information is sometimes only applicable during race day. If I were to be able to tie this in with training data, or understand how a runner has performed at previous races, I may be able to get a better picture of how this race would go.
The other aspect of the model developed here is it tries and measure the physical requirement for finishing Western States. However, ultras are said to be 60% mental as you've got to be able to fight through the pain, stomach issues, and whatever else comes up on race day that can't be planned for. So what is discussed above won't get you to the finish, but can at least help you understand some of the physical challenges of competing in Western States.
I hope to be able to continue the development of this project along with the accuracy and usefulness of this model.
My sincerest thanks goes out to Matt Fornito for helping me to refine this project and turn it into something meaninful and fun, and hopefully insightful to some. I know I have enjoyed it, and couldn't have gotten this far without his mentorship.
Secondly, I want to thank YOU reading this now. If you made this far (or just skipped to the bottom), you are reading this, and that is 1 more person than I thought would read this. Thank you.
Please don't hesitate to add commentary or thoughts on future improvements.