How does COVID-19 coronavirus illustrate timeless truths about data?
This is a version of two Twitter threads: one on how it illustrates timeless truths about data https://twitter.com/dougclow/status/1228248715728740352 and the other about authoritative information on the outbreak https://twitter.com/dougclow/status/1228246890287968256.
NB I do have some health and medical background but I am a data and learning professional, not a clinician, epidemiologist, or public health person.
Data science, BI and in fact any statistics all start with simple counting. That first bit is surprisingly hard, and getting it right is often most of the work. The current COVID-19 coronavirus outbreak shows this up nicely.
You can make as sophisticated a model as you like for an outbreak, to estimate things like the basic reproduction ratio (R_0, how many new cases each case is expected to cause, on average, which tells you how far it’s spreading), or the case fatality rate (how many people who catch it die of it), or the likely extent of the outbreak (how many people might catch it), and all sorts of stuff about how fast this is all happening or likely to happen.
But all that crucially depends on simple counting: how many have it at a given point, how many have died, etc. With the latest figures, we see how hard that is. The number of new cases in China jumped from 2,000 on Weds to 15,500 Thurs – because counting methodology was changed.
The numbers have changed retrospectively, too – I copied those numbers down yesterday, but today it looks like it was a 400 increase on Weds and a 15,100 on Thurs. This would be hard to get right even if there wasn’t a massive health crisis there.
Counting infections is always tricky, but surely deaths are easier? Turns out there are surprisingly difficult edge cases at the edge of life, but those don’t come up often, and almost everyone agrees about most deaths.
But even then, you’re probably getting data from multiple sources and combining them and that can lead to problems. Like today’s news that 108 deaths have been removed from the figures because they were double-counted.
It’s easy as a data scientist to say we need to invest in better data, and sometimes that’s right. But getting good basic counting data is hard, and expensive, and cannot be the absolute priority. The data you’re dealing with will always be messy to some degree.
Speaking of degrees, this crops up in education and learning. ‘How many learners do we have right now?’ is the basic question that is the denominator for pretty much any learning or teaching metric you care about.
And that is surprisingly hard to answer sometimes. There are late registrations, retrospective registrations, de-registrations, retrospective de-registrations, provisional versions of all those, and that’s just dealing with individuals.
When you have organisations buying in learning, it gets even worse: how many are provisionally ordered, how many are finally ordered, how many are catered for, how many show up, and how many are invoiced for are all different, and not the same as how many learned anything.
Speaking of invoices, cash at least should be easy to count? It should be clear when a customer paid us, right? Oh, my sweet summer child. That sound you hear is the entire accounting profession sniggering.
Suffice to say that the same payment can legitimately have different dates for cashflow, annual accounting, VAT, other taxes, and who knows what other purposes. Organisations are incentivised to manipulate this data, and most organisations respond to incentives.
Summarising a wide ramble: Even the simplest of data, like ‘How many people have COVID-19?’ can be surprisingly hard to get authoritatively. Getting better data is rarely a business imperative. Be cautious about interpreting your advanced statistical models.
Also, be kind to people who are working hard to do really difficult jobs in really difficult circumstances. And don’t make it harder by spreading misinformation.
Check with authoritative sources before passing on information. It does not help to spread stuff you think might be dodgy or far-fetched ‘just in case’. Most people who pass on misinformation don’t mean to cause problems. Check it’s right first. You can help protect your friends and colleagues from this hazard.
So, having said that, how do you know what’s right? Check with authoritative sources, like Wikipedia always tells you. Wikipedia has excellent info, which is rapidly changing as the situation changes rapidly: https://en.wikipedia.org/wiki/2019%E2%80%9320_Wuhan_coronavirus_outbreak
In the UK the risk of infection is very low, as in most places outside Hubei.
- Info on the situation from the UK Gov https://www.gov.uk/guidance/wuhan-novel-coronavirus-information-for-the-public
- Advice from the NHS https://www.nhs.uk/conditions/coronavirus-covid-19/
In short: If you think you may have the virus, stay where you are and call 111.
Outside the UK, there’s authoritative sources like the US CDC https://www.cdc.gov/coronavirus/index.html and the WHO https://www.who.int/emergencies/diseases/novel-coronavirus-2019
For hard research info, there’s stuff on the WHO site (currently under ‘technical information’ and ‘global research’) https://www.who.int/emergencies/diseases/novel-coronavirus-2019 and many publishers have made research freely available, and some have free-access portals on the topic, e.g.
- Lancet https://www.thelancet.com/coronavirus
- Cell Press https://www.cell.com/2019-nCOV
- Elsevier https://www.elsevier.com/connect/coronavirus-information-center
If you like statistics and numbers, here’s some good aggregation from Johns Hopkins University, with a few visualisations and a link to a well-maintained GitHub CSV. https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6
This data is pretty good, but don’t treat it – or any data! – as representing the objective truth.
This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission is needed to reuse or remix it (with attribution), but it’s nice to be notified if you do use it.