These are some of the principles I’ve drawn from more than 20 years experience in what we now call data science: using data to understand and improve human systems. They’re more guidelines than rules, particularly the ones rating one thing above another.
I’ve split this in to two lists. This one is aimed towards project sponsors, but obviously it’d be useful for anyone who has to interact with project sponsors. I have another one for project leaders. The idea is that this one has more of the things you need to bear in mind in setting up and resourcing a project and connecting it to the rest of the organisation, and the other has more of the things you need to bear in mind in getting the project done.
The numbers are people
Don’t let a focus on the numbers distract you from what really matters. Data projects are human projects. Data is only valuable in so far as it’s valuable to people. (Here’s an illustration of some of what I mean by ‘the numbers are people’ from 2013.)
There’s no use building a technically brilliant data system if it doesn’t actually help real people do something they want to do. The interplay between the technical system and the human systems should be a central focus of any project, not a side issue or left for later work.
Empower humans, don’t turn them in to meat robots. Make the computers do the drudgery and get the humans to do the more meaningful, higher-value work. That usually means giving them real power and autonomy. ‘Putting a human in the loop’ is a great idea when it means giving them real power and autonomy with excellent support from the technical system. But if they’re just babysitting an automated process and trying to spot the rare occasions it goes wrong, you’re setting them – and the entire system – up for failure.
Just because you can doesn’t mean you should
Data can be enormously powerful, and with that great power comes great responsibility.
Modern data capabilities open up all sorts of possibilities, some of which are life-enhancing and some of which are nightmarish. Whether or not it’s profitable to build a system that tries to infer people’s toilet habits and posts them online is a poor guide to whether you should make one.
There are legal responsibilities here (e.g. GDPR), and often policy requirements. But it is legal to do all sorts of bad things. I can’t tell you to be a better person than you are, but if you aren’t instinctively humane, remember that organisations are dependent on their ‘social licence to operate’, and that can be lost very quickly.
Even if it’s not inappropriate, it might not be worth doing in resource terms. It doesn’t matter how good your model’s root-mean-square error is, or how brilliant your classifier’s precision and recall stats are, if the whole thing doesn’t make business sense.
Inclusion over speed
You can get something done much faster if you don’t involve many other people. But if you build a customer support insight dashboard without involving any of the people who actually do the job, it might well turn out to be no help at all – and instead of them better supporting customers, they’re now resentful at having had a useless system foisted on them and you have a turnover problem.
The best projects include all stakeholders, with diverse voices, which enable you to build something that works for everyone. You already know your team needs diversity in terms of more visible characteristics like race and gender. This applies extra to data projects. And you also need a diversity of work roles.
Action over insight
Don’t just analyse, do something!
There’s a big emphasis in data science and business intelligence on ‘actionable insight’, which is great. But there’s no point investing in a system giving you actionable insight if you have no actions to take, or no resource to take them.
If you know you have problems with, say, some of your branches, better to invest in fixing those problems than in an expensive data-driven anomaly detection system that simply beleaguers your overworked staff with things they already know but can’t do anything about.
A data system can tell you where the problems are. A good one can tell you something about the nature of those problems. But it won’t fix them on its own.
(A really great data system can also tell you how well different fixes worked on similar problems in the past.)
Open over closed
Transparency and openness is very valuable for outsiders to know what you’re doing and to be reassured about your activity. It also enables you to be part of a wider community working together in good faith.
However, an open approach is often even more valuable in how much it helps the organisation work internally. Most outsiders don’t care about what you’re doing, but insiders do. They can do much more if they have fewer hoops to jump through and don’t have to wait for permission to get at things they need to do their jobs better.
As with all of these, this is a preference not an absolute: in some contexts a closed approach is absolutely required and appropriate. And there are degrees in between: often ‘open to all in the organisation’ is a good balance.
Data is a liability
Collecting more data is a bit like leasing more buildings. Even if the nominal cost is cheap, keeping them maintained and available for use on short notice is a significant resource drain, and if you neglect that, they’ll deteriorate more rapidly than you might imagine.
The more data you have, the more you need to spend on curating it, and the harder it will be to find the data that’s valuable.
Data can also be a liability in the hard legal sense. The more you have the longer it takes to audit it and be confident you are discharging your responsibilities.
Anonymisation is hard. Security is hard. Those deserve full posts of their own, but as a summary, Cory Doctorow’s view that “the best way to secure data is never to collect it in the first place” gets you a long way.
If the data isn’t valuable – and looked after – it’s worse than useless.
The numbers are people
This is the first and last point on purpose.
I’m a data person. It’s great to get in to the technical details and pore over the data. People who’ve worked with me often remark that you can always make me happy by showing me the data.
But I’m also a people person, and the data only matter because they affect people. It’s vital to remember what the numbers represent and why you care about them.
A final word from our sponsor
If you like the sound of this approach, I am currently available for data science consultancy.