These are some of the principles I’ve drawn from more than 20 years experience in what we now call data science: using data to understand and improve human systems. They’re more guidelines than rules, particularly the ones rating one thing above another.
I have two lists: this one for project leaders and managers, and that one for project sponsors. This one is focused on issues around how you deliver a project and what sort of a project it is; the project sponsor one is more about how you frame and resource a project. Obviously, they overlap and have great synergy.
The numbers are people
This was the first and last item on the list for project sponsors, and it applies here just as much. Obviously, a project has to deliver what it’s set up to deliver, but those objectives matter because they matter to people, and how it’s delivered matters to people.
Comprehensibility over precision
A simple, comprehensible metric is often better than a more precise but complex one that nobody understands apart from the clever techie who devised it. The technical work to build a machine learning model (feature selection) tells you which of a load of data gives you the most predictive or actionable value. In many situations, most of it comes from one or two simple measures – so consider working directly with those, rather than the output of a big machine learning system.
For example, it’s much easier to understand that a user who’s not logged in for ages might not come back than it is to make sense of a predicted dropout propensity probability yielded by a weighted vote of a panel of six machine learning algorithms using a dataset with thirty derived features from the complete interaction log datastream that etc etc …
And when you do go for machine learning, favour interpretable algorithms such as decision trees over black boxes like neural networks, even at the cost of some precision. A system involving humans works better when humans understand it. If you’re in to that stuff, it can be great fun playing with representations of the middle layers of neural networks, but that’s not something you should put in front of people trying to get their job done.
Nowcasts over forecasts
As the old saying has it, prediction is hard, particularly about the future. But it can be surprisingly valuable to know what’s happening right now and what happened in the immediate past. It’s also often much more accurate.
You’ll notice that weather forecasts spend about half their time telling you what the weather is right now rather than what their guess is about what will happen. This is in truth a great technical achievement, requiring a huge investment in data capture (satellites, weather stations) and sensemaking.
So, for instance, it’s much easier to make a system that can accurately tell you what your performance figures are right now than one that can predict what they’ll be tomorrow.
Whole data over sampling
If you have it all, use it all.
Statisticians, researchers, and data scientists have a wide range of tools and techniques for addressing the thorny problems around sampling theory and trying to generalise about a whole population from a sample.
You can avoid all that by simply using data from the whole population.
In the past, this was simply infeasible. If you know what you’re doing, it’s now easy to collect huge volumes of data and analyse it very quickly. It’s not always easy to do it well, of course.
Qualitative feedback from humans over quantitative
If you’re going to ask humans for feedback, it’s better to get some words than numbers and click boxes. Get the numbers automatically from an automated system, and only bother the people themselves to find out what the numbers mean.
We are all hopelessly over-surveyed. These days everything wants tickbox feedback or a star rating, so most of us don’t bother most of the time. That means those ratings are dependent on the weird subgroup who do respond, which is bound to be wildly unrepresentative.
That’s not to say getting information from people isn’t important. But use that scarce resource to get qualitative information – the meaning behind the numbers – rather than misleading numbers from the oddballs who still fill in surveys.
Say a problem crops up with your application process. If you ask for ratings, you won’t get many and they’ll be low, but it won’t tell you why. If instead you use an automated system to flag up that successful applications have fallen sharply, open text box responses can tell you that it’s (say) because they’re getting contradictory advice from the form and your reps.
Shared understanding over formal specs
Often, one of the biggest challenges in a data project is getting to a shared understanding of what the data is, how it’s generated, and what it means.
It’s tempting to sidestep that hard work. One smart person or team ‘just doing it’ and documenting what they do is better than them not writing anything down. But that’ll be really hard for anyone else to understand, unless they’re unusually skilled at that sort of communication. And without undue stereotyping, most people who are unusually skilled at TensorFlow, scikit-learn, and Weka are not also unusually skilled at communicating with ordinary mortals.
Documentation is still important! When you have built that precious shared understanding, write it down somewhere that people can find it again.
Almost all data is useless
The key word here is ‘almost’.
A good early step in building your data capability is a data potential audit, where you review all the data you currently collect and all the data you could easily collect, and work with a wide range of stakeholders to understand it and what the potential is for using it better. When you do that, you’ll almost certainly find that almost of all it is useless.
But not all of it! Finding those needles in the haystack is hard work but can be transformatory.
A final word from our sponsor
If you like the sound of this approach, I am currently available for data science consultancy.