COVID-19 coronavirus and data

How does COVID-19 coronavirus illustrate timeless truths about data?

This is a version of two Twitter threads: one on how it illustrates timeless truths about data and the other about authoritative information on the outbreak

NB I do have some health and medical background but I am a data and learning professional, not a clinician, epidemiologist, or public health person.

Canarian Raven
“The virus is called COVID-19. No R. Call it CORVID one more time and I’ll peck your eyes out.”

Data science, BI and in fact any statistics all start with simple counting. That first bit is surprisingly hard, and getting it right is often most of the work. The current COVID-19 coronavirus outbreak shows this up nicely.

You can make as sophisticated a model as you like for an outbreak, to estimate things like the basic reproduction ratio (R_0, how many new cases each case is expected to cause, on average, which tells you how far it’s spreading), or the case fatality rate (how many people who catch it die of it), or the likely extent of the outbreak (how many people might catch it), and all sorts of stuff about how fast this is all happening or likely to happen.

But all that crucially depends on simple counting: how many have it at a given point, how many have died, etc. With the latest figures, we see how hard that is. The number of new cases in China jumped from 2,000 on Weds to 15,500 Thurs – because counting methodology was changed.

The numbers have changed retrospectively, too – I copied those numbers down yesterday, but today it looks like it was a 400 increase on Weds and a 15,100 on Thurs. This would be hard to get right even if there wasn’t a massive health crisis there.

Counting infections is always tricky, but surely deaths are easier? Turns out there are surprisingly difficult edge cases at the edge of life, but those don’t come up often, and almost everyone agrees about most deaths.

But even then, you’re probably getting data from multiple sources and combining them and that can lead to problems. Like today’s news that 108 deaths have been removed from the figures because they were double-counted.

It’s easy as a data scientist to say we need to invest in better data, and sometimes that’s right. But getting good basic counting data is hard, and expensive, and cannot be the absolute priority. The data you’re dealing with will always be messy to some degree.

Speaking of degrees, this crops up in education and learning. ‘How many learners do we have right now?’ is the basic question that is the denominator for pretty much any learning or teaching metric you care about.

And that is surprisingly hard to answer sometimes. There are late registrations, retrospective registrations, de-registrations, retrospective de-registrations, provisional versions of all those, and that’s just dealing with individuals.

When you have organisations buying in learning, it gets even worse: how many are provisionally ordered, how many are finally ordered, how many are catered for, how many show up, and how many are invoiced for are all different, and not the same as how many learned anything.

Speaking of invoices, cash at least should be easy to count? It should be clear when a customer paid us, right? Oh, my sweet summer child. That sound you hear is the entire accounting profession sniggering.

Suffice to say that the same payment can legitimately have different dates for cashflow, annual accounting, VAT, other taxes, and who knows what other purposes. Organisations are incentivised to manipulate this data, and most organisations respond to incentives.

Summarising a wide ramble: Even the simplest of data, like ‘How many people have COVID-19?’ can be surprisingly hard to get authoritatively. Getting better data is rarely a business imperative. Be cautious about interpreting your advanced statistical models.

Also, be kind to people who are working hard to do really difficult jobs in really difficult circumstances. And don’t make it harder by spreading misinformation.

Check with authoritative sources before passing on information. It does not help to spread stuff you think might be dodgy or far-fetched ‘just in case’. Most people who pass on misinformation don’t mean to cause problems. Check it’s right first. You can help protect your friends and colleagues from this hazard.

So, having said that, how do you know what’s right? Check with authoritative sources, like Wikipedia always tells you. Wikipedia has excellent info, which is rapidly changing as the situation changes rapidly:

In the UK the risk of infection is very low, as in most places outside Hubei.

In short: If you think you may have the virus, stay where you are and call 111.

Outside the UK, there’s authoritative sources like the US CDC and the WHO

For hard research info, there’s stuff on the WHO site (currently under ‘technical information’ and  ‘global research’) and many publishers have made research freely available, and some have free-access portals on the topic, e.g.

If you like statistics and numbers, here’s some good aggregation from Johns Hopkins University, with a few visualisations and a link to a well-maintained GitHub CSV.  

This data is pretty good, but don’t treat it – or any data! – as representing the objective truth.

This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission is needed to reuse or remix it (with attribution), but it’s nice to be notified if you do use it.

Author: dougclow

Experienced project leader, data scientist, researcher, analyst, teacher, developer, educational technologist and manager. I particularly enjoy rapidly appraising new-to-me contexts, and mediating between highly technical specialisms and others, from ordinary users to senior management. After 20 years at the OU as an academic, I am now a self-employed consultant, building on my skills and experience in working with people, technology, data science, and artificial intelligence, in the education field and beyond.

One thought on “COVID-19 coronavirus and data”

Comments are closed.