The gory details.
Here is a fuller accounting of the details and assumptions behind the "Do Great Individuals Make Great Teams?" analysis. I assume some familiarity with linear regression and the R language.
Transformation
The fundamental data I'm using are the ranks, i.e. the overall worldwide finishes from the 2012-2014 Open and Games. In particular, I used those data for each of the 43 Games-qualifying teams from those three years for the team analyses, and for the 43 or more individual Games qualifiers in those years.
We're interested in predicting Games rank on the basis of the team members' individual ranks. I transformed the ranks into standard deviations from the mean, or z-scores, before doing the prediction. My reasons for doing that were both pragmatic and theoretical. The pragmatic reason was that the R^2 was poor, like less than 1 percent, if you don't do any transformation, and my standard deviation transform produces a much better fit. The theoretical reason was that it seems reasonable to work in terms of standard deviations for performance data like these. For example, a log transform also fixes the R^2, but thinking in terms of standard deviations of finish makes more sense to me than in terms of log-ranks.
This transformation entails one design decision, that becomes relevant for our Invitational simulations below. In order to convert from a rank to a percentile, on your way to a z-score, you're bound to pick a number that represents the field size, call it n_f. To make this more concrete, with the males' Open ranks, the finishes run from very low, like top 100 in the world, to pretty modest, like 20,000th in the world. Should n_f be like 129 = (43 * 3), the number of males in the team competition, or should it be like 20,000? If it's 20,000 then the z-scores of most of the field are going to be clumped together to the left of zero. Again I let pragmatism be the guide, and found that the fit is better with n_f of 129 than of 20,000. So, to do the transform, I re-ranked the 129 male team members with respect to just the pool of male team members, so the best in the Open is ranked first, and the worst is ranked 129th. In R that's like:
mens.rank.adjusted <- order(order(mens.worldwide.open.ranks))
Then I converted that rank to a percentile, like so:
mens.percentile.adjusted <- mens.rank.adjusted / (1 + length(mens.rank.adjusted))
And finally, I move into units of standard deviations using the inverse Normal CDF, R's qnorm:
mens.zscore.adjusted <- qnorm(mens.percentile.adjusted)
So now somebody like Tommy Hackenbruck's top 100 in the world will look like -2.4, say, standard deviations from the mean of the men's Games team field. I performed the same transformation for the women's side.
Regression Model
I ended up using a model like:
games.z ~ average.mens.individual.z + average.womens.individual.z
and fitting it with robust regression as opposed to regular linear regression.
The reason I used that model, as opposed to a model where you have each of the six team member's z-scores on the right-hand side, is that the larger model's fit wasn't better, via adjusted R^2, and the smaller model was more naturally adapted to the Invitational, which has a team size of four rather than six.
I fit the model using lmrob from R's robustbase package. My reason for doing that was that the diagnostics from the regular regression showed some high-leverage data points. I confirmed via re-running the regression with a few of these points removed, and via bootstrapping the input data, that the regression estimates and R^2 were sensitive to outliers.
Rather than removing high-leverage points, I turned to robust regression, via lmrob. The values that I report in the tables as "Percent contribution of individuals to team performance," i.e. the variance explained by individuals, are the adjusted R^2 values produced by lmrob.
For the Individual Games competitors, I used the same transform and robust regression, and fit the model:
games.zscore ~ open.adjusted.zscore
The estimates of variance explained are the adjusted R^2 again. The pooled estimates were obtained the same way as for the teams. One would probably get a better fit using Cross-Regional Comparison (CRC) ranks instead of Open ranks, since most elite Games athletes don't need to go all-out get through the Open, and many explicitly acknowledge that they train through that stage. But it seemed more apples-to-apples to use the Open ranks. It may be true that competitors on Games-level teams don't need to go all-out to get their teams to regionals either.
Simulation of Invitational
To simulate the Invitational, I built a pooled model for the teams. To do this, I simply concatenated the three datasets together, after I had transformed to a z-score, and re-ran lmrob. One could argue that a mixed model with a random effect for year might also have been reasonable, but I didn't explore it here. The pooled model looks like:
games.z = 0.01438 + 0.40439 * avg.mens.z + 0.34859 * avg.womens.z + error
where error is Normal(0,σ=0.8959469), where that σ value comes from the lmrob estimate of the residual standard error. Naturally, the prediction/forecast standard error is probably a bit higher than that, because of uncertainty in estimating the standard error. Having a higher SE would tend to bring the predictions a little closer together. I just used the SE as estimated in this case.
For the men's z-scores and women's z-scores, I used the 2014 Open, Regional (CRC) and Games ranks, transformed as above, with n_f of 43, the Games field size. That is, when computing the Open, CRC, and Games ranks, I used the athletes' ranks relative the 2014 Games field at each competition stage. I transformed their ranks at each stage into z-scores, and then I averaged those z-scores. One could argue for a field size of eight here, which would push the predictions apart. I reasoned that the representative pool for this competition is Games athletes, so used 43.
Sam Briggs didn't compete in the Games, and Kara Webb pulled out due to injury. Rather than make up numbers for them, although it's tempting, for them their z-scores are based only on their 2014 Open and CRC performances.
One could integrate these distributions analytically, but for convenience of computation I ran a million simulated Invitationals. In each simulated Invitational, I generated four performances from the above model on the basis of the z-scores, and noted which team won. I aggregated the wins into a table, and that's what's shown. One could clearly also look at the distribution of second place, third place and so on, if one were interested.
About the Author
Mike Macpherson (@datawod) has been doing CrossFit since 2010, and analyzing Games data for about as long. He teaches genetics and statistics at Chapman University.