New data: old jokes

One of the joys of data science / Big Data / the quantitative turn* in my area is the chance to recycle ancient maths jokes. So, for instance, take this old chestnut:

Theorem: All positive integers are interesting.

Proof: Assume the contrary. If that is the case, there must be an integer x such that x is the smallest integer that is not interesting in any way. But that makes x a pretty interesting number – contradiction! Therefore all positive integers are interesting.

We can apply that to data dredging – the process of ploughing through a large dataset without much of an idea of what you’re looking for, in search of ‘interesting’ findings – along with some mischief about probability, and we get:

Theorem: All data dredging exercises yield interesting findings.

Proof: Assume the contrary. If that is the case, a data dredging exercise x could find no significant correlations despite testing many hundreds, if not thousands of potential correlations. But we’d expect  5% of correlations to be significant at the p < 0.05 level by pure chance. To look so hard and find so few would be extremely unusual. So that would be a pretty interesting finding from exercise x – contradiction! Therefore all data dredging exercises yield interesting findings.

May your data analysis in 2014 be at least that interesting.

* If nobody else has used ‘the quantitative turn’ as a phrase to characterise what’s happening with analytics at the moment, particularly in education, I am totally claiming bagsies. Although actually I’m more keen on an empirical turn than a quantitative one per se.

This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission needed to reuse or remix (with attribution), but it’s nice to be notified if you do use it.

University of Autocomplete

With tuition fees so high, it’s never been more important to give prospective students good information about university. This report is the result of a highly sophisticated social media sentiment mining and semantic content analysis, providing summary overviews. Or, in other words, I put “X is” and “X are” in to Google and picked the things that seemed most helpful from the autocomplete list that popped up. (Previous related work explored the geography of the UK, and human nature. It’s also been used as a source of found poetry.)


Here, for the first time, is Autocomplete University: a comprehensive exploration of the collective Internet view of the university is all its component parts. Universities did reasonably well when this methodology was applied to US States. Sadly, things don’t look quite so rosy for universities themselves. The study found: “The academy is almost here. Academics are jerks.”. There’s a mixed picture for students: “undergraduates are miserable”, but on the other hand “graduates are prepared for the world of work”. Some highlights:

  • Classics is a good degree. Classicists are smart.
  • Biology is destiny. Biologists create zombie cells. Genetics is not an excuse.
  • Psychology is bullshit. Psychologists are least likely to suggest that.
  • Linguistics is hard. Linguists are hot.
  • Dentistry is a bad career. Dentists are sadists.
  • Librarians are hiding something: Library is no longer working.
  • VCs are liars. VCs are not your friends.
  • Research is the door to tomorrow. Door is stuck.
  • Web pages are not responding. Forums are full of idiots. Instant messaging is available only from AOL.

Without further ado, here is the complete report.

Continue reading “University of Autocomplete”