Sunday, June 26, 2011

Empirical Measure of Reliability: Part Four

This fourth installment on Twitter predictions for the NBA Finals will wrap up, for now, my pilot study on measuring reliability as a function of qualities that are detectable in the writings of the individuals who make the predictions. I've made some observations in the three preceding posts, and here I will make a few more and summing up the overall results.

First, the lay of the land: I collected predictions that individuals made on Twitter regarding the 2011 NBA Finals. The predictions were of which team would win and how many games the best-of-seven series would last. This gave eight possible outcomes, which ranged from one team winning in a four-game sweep to the other winning in a four-game sweep. Besides the individuals making predictions, there were other ways to quantify the soundness of each possible outcome, and the odds posted by Las Vegas sports books offer a possible "rationalist" position. At least, if there is a systematic error in the sports books, then there is a way for someone who knows more to get rich.

The sports books predicted a fairly close series, with Miami favored, but only slightly, with Miami-in-seven as the most likely outcome. As such, the least-favored outcomes were blowouts -- Miami-in-four or Dallas-in-four. Even before the Finals began, those seemed to be the brashest/wrongest predictions. And that proved to be true very soon, as each team won one of the first two games, making both of those predictions already wrong. Note that in some other year, where one team was dramatically superior to the other, predicting a four-game series might have been the smart bet -- but that was not true this year.

The heart of this study is to analyze the text that these 165 Twitter users have offered in other tweets (not the specific single prediction itself) and see if there is a fingerprint for reliability in the text that people generate.

Here are the observations that I have previously noted:

1) The people who made the rashest and most-incorrect predictions, a four-game series, differed from the crowd in being less prone to punctuate their tweets. (Note: There were only six such individuals, so this finding may not be significant.) They did not stand out in other obvious ways, such as the use of capitalization.

2) The people who made more extreme predictions (four or five-game series; including the "5s" raised the sample size to 32) were more likely to use the modal verb "must" than the modal verb "might." This is suggestive that some people carry a black-and-white worldview around versus those who see things in shades of gray; the "must" people predicted a lopsided outcome despite the judgment of Las Vegas (and the actual outcome of the series) that was less extreme.

3) While the frequency of the eight possible predictions roughly matched the Las Vegas probabilities (Pearson correlation r=0.56), the crowd deviated from this primarily in over-predicting Dallas-in-six. Interestingly, this ended up being the actual outcome of the series! Is there a significant minority out there (about 25% of the individuals) who know better than Vegas?!

Now, two observations that I have not reported previously:

4) Do those Dallas-in-six predictors show some sign of brilliance in their vocabulary? The words most apt for those people to use more often than the other 75% of individuals were: {awesome, follow, through, morning, maybe}. The words they used less often than the others were: {watching, fun, free}. If there's anything plausible about a worldview here (as with the "must" vs. "might" observation) I don't see it.

5) We can rate the full set of predictions according to correctness (Dallas-in-six being exactly right, but every other prediction being a certain number of games away from this, ranging from one game off to five games off). And then we can correlate correctness with the gross properties of the individuals' tweets and we see that correctness correlated positively with:

A) People who type longer tweets were more accurate than those who type shorter tweets. Pearson correlation, r=0.48.

B) People who use more mixed capitalization (capitalizing the first letters of words, vs. leaving the whole word lowercase or all-uppercase) were more accurate: Pearson correlation, r=0.46.

C) An overall measure of "fluency", using the correct English function words like "and" and "be" correlated only slightly with correctness: r=0.08.

I stress again that this study was not painstakingly scientific, but I would like to use it as a pathfinder towards more informative studies in the coming months. An ambitious-enough goal would be to assess which individuals write in such a way as to seem to be systematically deluded, showing the world that they dispense with facts and wisdom and draw their own conclusions anyway. A yet-more ambitious goal would be to distinguish those individuals of exceptional reliability from those of average reliability, and perhaps to use the crowd as a predictor of the future that is better than anyone has yet systematically recognized (imagine if this led to predictions that were smarter than Las Vegas tends to posit).

While it will be fun and informative to continue this work with other sports events (a ready source of quantitative predictions that can be graded objectively), it would be even more rewarding to evaluate the soundness of predictions regarding politics, policy, technology, and science. It would of course be useful to analyze qualities that are deeper and more meaningful than punctuation. And I will confess to an ultimate goal of collecting empirical statistics on the soundness of various kinds of higher reasoning. Can we do the same as I've done here with arguments that intelligent astronomers made in the Twenties through Fifties, grading those predictions with the correct answers that we now in many cases have? Can we have a sort of truth-o-meter based on this sort of empirical work? Or, at least, can we show people that if they write badly they will seem less reliable? Watch this space in the months to come.

No comments: