A different race calls for a different scope of evaluating results. A trail running friend of mine created a race that will have its first go May 2021; initially on the evening of January's full moon, it was called "Wolf Moon 25k". While keeping the same spirit, but a little safer with COVID protocols and now in May, the race will be a hilly night team race.
For most team trail races (are there many established other trail team races other than cross country??), a team's score is determined by summing the finish places of all those on the team, then the winning team has the lowest score. For this race, the winning team will be determined in the same fashion, but each team consists of only 3 runners.
While it would be simple enough to rank the teams by their overall finish place, but unlike a cross country, this race is open to all, and there is no restriction on the makeup of a team. My goal was to come up with a scaling factor to apply to each racer based on age and gender to normalize team results.
To properly develop this scale, I needed some data to create the model off of. Based on some information about the course, I knew the course would have around 200 ft/mile of climbing (~9 miles and 1800 ft of vertical gain). I have looked at some races around PA before and opted to pull a few different race results from the following 3 races:
- Rothrock Trail Challenge (RTC) - 25 km & 4000' vert >> 232 VAM (vertical ascent per mile)
- Greenwood Furnace Trail Challenge (GFTC) - 13.1 miles & 3000' vert >> 229 VAM
- Green Monster Trail Challenge (Green Monster) - 15 km and 25 km & 4020' >> 24 VAM (for the 25 km)
I pulled the results either from UltraSignup or from Falcon Race Timing, and have a sample of what the data look like below. The sample size of the data pulled is just over 3000 runners in all the races.
|RACE||Year||PLACE||NAME||CITY||AGE||GENDER||Age Rank||Age Cat||Pace||pace_dec||race_cat|
|GFTC||2012||2||DARREN BALKEY||STATE COLLEGE PA||20||M||2||0-99||8:36/M||8.60||13.1mi|
|GFTC||2012||3||NICK WOJTASIK||ERIE PA||20||M||3||0-99||8:40/M||8.67||13.1mi|
|GFTC||2012||4||JAMES RAYBURN||YORK PA||50||M||4||0-99||8:45/M||8.75||13.1mi|
|GFTC||2012||5||MEIRA MINARD||STATE COLLEGE PA||38||F||1||0-99||8:52/M||8.87||13.1mi|
|GFTC||2012||6||ANDREW MEDLYN||TOWSON MD||22||M||5||0-99||8:55/M||8.92||13.1mi|
Setting Up the Data for Analysis
In an effort to not do too much work, I opted for age groupings by decades. It's worth pointing out that UltraSignup does not have age groupings, while Falcon Race Timing and RunSignup both use decades for age groupings. Not that the latter two sites are the authority on timing results by age group, but it makes sense to follow a convention set forth by established companies, as most trail races in PA use 1 of those 3 for results tabulation.
Additionally, we are also going to distinguish between the gender reported on these reports - M or F. By using both gender and age group, we can explore the data and come up with a scale to help create a system to normalize the team results for this race.
Understanding the Races
First a few more details about the races that were chosen. The Rothrock Trail Challenge used to be held every June, was discontinued after 2018, and now has a return in 2021. I ran this race in 2018 and thought it would be a good technical race to include. It has the most participants (~300) that take on the 17.2 miles (25 km category) and roughly 4000' of vert.
Greenwood Furnace Trail Challenge is held in central PA, close to where Rothrock is held. I'm not all too familiar with the course, but based on the profile and from what I can gather about the area, it has some big climbs and stunning beauty mixed with rocks and rhododendrons. The course has close to 3000' vertical gain for the 13 miles (recall - 229 VAM).
Lastly, the Green Monster is a race near Wellsboro, PA and features some healthy climbing and multiple distance options. There is a 15 km and 25 km option. The 15 km is ideal since it's the same distance, and maybe more climbing? I didn't see the full profile or vert for the 15 km, but the 25 km is below, coming in at 16.6 miles and ~4000' gain (so 242 VAM).
Naturally, what's an article of mine without some fun graphs! First, a look at the diffferent paces (x-axis) and each racer's finishing place (y-axis). Typically, there is a bit of separation between the faster runners finishing in the top slots, then a lot of folks that have a similar (and reasonable) pace, and then finish off with some more separation in pace between the back of the pack. The expected shape, when comparing pace vs place, is something like an S-curve.
The taller the graphs are, the more participants there were in the races. Meaning Rothrock Trail Challenge (RTC) had the most participants. A few other things to note: the steeper the middle section of the S, the closer the finishers were together; whereas, the farther spread out along the x-axis, the finishers were more spaced out.
It is a bit challenging to tell overall finish times, but I think the Greenwood Furnace has faster times than the other 2 races potentially due to the face that the race is 2 very large climbs intead of short punchy climbs that the other races have.
|RACE NAME||Avg. Runners||Avg. Pace [min/mile]|
|Green Monster (25 km)||132||15.5|
Pace Distribution By Age and Gender
This gives us an understanding for how many runners take on each race, the relative speed of each one, and how finishers, as a whole, cross the finish line. However, since we're looking at this through the lens of a team race with no limits on team composition, we'll look at a few stats on age groupings and gender.
Comparing the average pace by 10-year groupings, instead of by race, we can better understand why there's a need for normalization if there are teams comprised of racers spanning multiple decades. While not always case, on average, if Team A had 3 runners of the ages 18, 19, and 22 and Team B was made up runners that are 33, 45, and 58, Team A would have a greater chance of winning.
Let's further categorize the data by gender and see how the data is distributed by age group and gender.
The top plot in the above figure illustrates the difference in total sample size per age group and gender. Across the board, more men enter into these races than women, with the 20-30 decade being very close to equal. This is hopefully something that changes in the future, and I think these races have a better split than some other trail races; however, by looking at the bottom plot of the figure above, there is a distinction between paces of both genders within an age group.
The lower plot is called a
raincloud plot and is quite helpful to show a lot of information with one plot. One thing to notice is the density distribution plot (bell-shaped area plot). The distribution is largely normal (bell-curve) but for males, there seems to be long tail, whereas for females, the distribution is more normal (aside from the 60+ age group).
The other thing to see on the
raincloud plot, is the box plot, where the middle of a boxplot is the median pace. The separation between the boxes within each group is indicative of a difference between male and female runners for each decade.
Pace Normalization Within Age Group By Gender
To begin the normalization, we'll center the pace distribution with each age group around the overall mean.
|age_group||GENDER||Total Count In Age Group||Gender Count In Age Group||pace_gender||pace_age_group||scale_factor|
There was no fancy scaling here, instead, I just took the average pace for each gender and scaled it toward the average pace for the respective age group. To scale it, the formula is
scale_factore = (pace_age_group/pace_gender). I then took the scale factor and multiplied it to an individual's pace based on gender and age group.
The result of scalin this yields a more centered pace within each age group. Using the raincloud plot, we can see how it changed the distributions and brought the boxplots together.
Notice how the major hump of each cloud and the boxes are all lined up a lot closer together. Success (for the initial pass)!
Normalization of Age Groups
Part 2 of the normalization is to normalize across age groups. This is accomplished in a similar way as above, but instead of scaling towards the average pace of the 20-30 age group. Why 20-30? This age group was the second fastest (about 11 seconds slower than the 10-20 group); however, this group had over 300 data points than its counterpart. A little arbitrary, I admit, but for now that's going to be our standard. The pace to normalize around - 14.39 min/mile.
By applying the scale factor, we were able to get the distributions (clouds) and boxplots to line up, roughly, in line with each other, signifying a successful normalization! With this data created and scale factore generated, we can then apply it to the finish times of this upcoming race! There may be some refinement required, but the goal of this was to get something that's fairly robust and applicable to a hilly trail race.
Here is the data table with the scale factor.
|age_group||GENDER||Total Count in Age Group||Gender Count in Age Group||pace_gender||pace_age_group||scale_factor||overall_pace||overall_scale_factor|