Michael Li@tianhuil / 1:00 pm MDT•August 27, 2020
Image Credits: Westend61 (opens in a new window)/ Getty Images
Michael LiContributor
Tianhui Michael Li is founder of The Data Incubator, an eight-week fellowship to help PhDs and postdocs transition from academia into industry. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan and D.E. Shaw.
More posts by this contributor
What’s different about hiring data scientists in 2020?
Five building blocks of a data-driven culture
“Will automation eliminate data science positions?”
This is a question I’m asked at almost every conference I attend, and it usually comes from someone from one of two groups with a vested interest in the answer: The first is current or aspiring practitioners who are wondering about their future employment prospects. The second consists of executives and managers who are just starting on their data science journey.
They have often just heard that Target can determine whether a customer is pregnant from her shopping patterns and are hoping for similarly powerful tools for their data. And they have heard the latest automated-AI vendor pitch that promises to deliver what Target did (and more!) without data scientists. We argue that automation and better data science tooling will not eliminate or even reduce data science positions (including use cases like the Target story). It creates more of them!
Here’s why.
Understanding the business problem is the biggest challenge
The most important question in data science is not which machine learning algorithm to choose or even how to clean your data. It is the questions you need to ask before even one line of code is written: What data do you choose and what questions do you choose to ask of that data?
What is missing (or wishfully assumed) from the popular imagination is the ingenuity, creativity and business understanding that goes into those tasks. Why do we care if our customers are pregnant? Target’s data scientists had built upon substantial earlier work to understand why this was a lucrative customer demographic primed to switch retailers. Which datasets are available and how can we pose scientifically testable questions of those datasets?
Target’s data science team happened to have baby registry data tied to purchasing history and knew how to tie that to customer spending. How do we measure success? Formulating nontechnical requirements into technical questions that can be answered with data is amongst the most challenging data science tasks — and probably the hardest to do well. Without experienced humans to formulate these questions, we would not be able to even start on the journey of data science.
Making your assumptions
After formulating a data science question, data scientists need to outline their assumptions. This often manifests itself in the form of data munging, data cleaning and feature engineering. Real-world data are notoriously dirty and many assumptions have to be made to bridge the gap between the data we have and the business or policy questions we are seeking to address. These assumptions are also highly dependent on real-world knowledge and business context.
In the Target example, data scientists had to make assumptions about proxy variables for pregnancy, realistic time frame of their analyses and appropriate control groups for accurate comparison. They almost certainly had to make realistic assumptions that allowed them to throw out extraneous data and correctly normalize features. All of this work depends critically on human judgment. Removing the human from the loop can be dangerous as we have seen with the recent spate of bias-in-machine-learning incidents. It is perhaps no coincidence that many of them revolve around deep learning algorithms that make some of the strongest claims to do away with feature engineering.
So while parts of core machine learning are automated (in fact, we even teach some of the ways to automate those workflows), the data munging, data cleaning and feature engineering (which comprises 90% of the real work in data science) cannot be safely automated away.
A historical analogy
There is a clear precedent in history to suggest data science will not be automated away. There is another field where highly trained humans are crafting code to make computers perform amazing feats. These humans are paid a significant premium over others who are not trained in this field and (perhaps not surprisingly) there are education programs specializing in training this skill. The resulting economic pressure to automate this field is equally, if not more, intense. This field is software engineering.
Indeed, as software engineering has become easier, the demand for programmers has only grown. This paradox — that automation increases productivity, driving down prices and ultimately driving up demand is not new — we’ve seen it again and again in fields ranging from software engineering to financial analysis to accounting. Data science is no exception and automation will likely drive up demand for this skillset, not down