Sunday 20 October 2019

Twilight of the unicorn Data Scientist: how the sexiest job of this decade is going to change


Hello and welcome to Open Citizen Data Science!

After a very interesting two days full-immersion in the world of citizen data science at Alteryx Inspire in London, some trends are starting to become clear.





Data Science is about to change radically: While Data Scientists will remain an extremely important asset of any business, they are destined to be a factor mostly in bigger companies, while for smaller ones they will come either in consultant role or do data science part-time among other tasks.

This is not because the value of a Data Scientist is decreasing, but ironically it is due to them being a valuable and scarce resource: the market is effectively looking for alternative ways to get at least some of the value and better empowering their existing staff.

How is this happening?
Let's briefly recap the classical definition of the unicorn data scientist:


As you can see, the stereotypical data scientist is supposed to be an expert in statistics, programming and the field of business they work on, all at the same time.
Of course, this means that not many people can reach such a broad skillset effectively, meaning that they are hard to find and rightfully expensive to keep.

The need was effectively unfulfilled, so what happened in this decade to compensate for this?

For medium and larger businesses that don't need or cannot afford a full-time Data Science team, the normal reaction is to ask for help from a trusted consulting partner, which will "of course" have plenty of Data Scientists to spare:

Typically, they will be programmers (bioinformatics is also a popular starting field for Data Science consultants) with some additional statistical training or who just learnt how to run the most popular statistic libraries. How do we know this?

One indicator is the explosion of Python usage in Data Science: although it's now changing in academia as well, the typical Data Scientist with a statistical background will more likely have an R or (less often) Mathlab background. 
Let us remember the original Random Forest algorithm was written in Fortran, which is definitely almost extinct ouside academia!

Python, on the other hand, is popular among programmers for projects that need to be done in a short amount of time (typical of consulting) and in programming "boot camps".
This professional figure will know very little about how the machine learning algorithms they use work on a mathematical standpoint, to them they are just one of the tools of the trade, or a set of libraries to load with a few parameters to tweak.

Now, this does not necessarily mean that a consultant will not be an effective solution: individual consultants are hard-working and bright people (although I have very little praise for many of the consulting firms they work for).
Consultants with a programming focus will usually solve the business problem at hand as for most projects a profound knowledge of statistics and the mathematical intricacies of how an algorithm works are not vital skills. 

This however means that if left alone consultants tend to be weaker on the feature engineering side, which is one of the reasons "deep learning" is often proposed even with structured data as it allows to skimp of feature engineering and still reach decent accuracy (although at a very heavy computational cost!).

The main problem with consultants however tends to be a more mundane one: they are very expensive for anything but a short term solution.
This means something more cost-effective was needed and a second solution was found in Advanced Analytics software:

Advanced Analytics means that even software library set-up is hidden away from the user and little to no programming skill is required.
This means that with very little training your typical Business Analyst with Excel and Access knowledge can now build a dataset and train machine learning models without even knowing any programming language.

As the cost of a business analyst plus software tends to be less (often a lot less) than either a Data Scientist or a consultant, this is seen by many businesses as a bargain!
However, as tempting as it may sound, this is not without a price.

One rightfully criticized side of the trade-off  between skill and software is that what you gain in ease of use, you tend to loose in flexibility: Instead of having hundreds of machine learning libraries at your disposal, you now get a couple dozens at best.

Does it really matter?  

Your typical business problem involving machine learning will usually involve structured data (we're not talking about Silicon Valley unicorns here) and either classification or regression, with occasionally some clustering involved. Another typical business requirement is explainability, which tends to exclude anything related to neural networks and deep learning in general.

In this setting, what you will typically need is one or a few of those:

- Linear regression
- Logistic regression
- Decision trees
- Tools for boosting or bagging when you need extra performance
- Naive Bayes
- K-means
- Nearest Neigbour
- Market basket analysis
- Random and stratified sampling
- ARIMA or similar forecasting tools

This is of course not meant to be a comprehensive list, but suddenly now a couple dozen libraries sounds a lot less restrictive, especially as businesses aren't trying to win a Kaggle contest.
Still, as tempting as it may sounds to declare the problem solved, there are still a major issue that needs to be solved: Building effective datasets

A typical Business Analyst will know how to put together a dataset with even a large number of features and to decently clean it, however they will lack some statistical finesse:

- Are the variables too little or too closely correlated to the target variable?
- Do they need normalization?
- Should you perform one-hot encoding on categorical variables?
- Do they have too much or too little variance?

While this is currently an issue, two concepts are being now developed that will further simplify this step: Assisted modeling and automated model selection

While the latter is getting recognition through Google's AutoML initiatives and also products like Datarobot, the former has been one of the main highlights of Alteryx Inspire 2019.



Assisted modeling is the counterpart to AutoML: While AutoML is all about testing several models and picking the most performing one, Assisted Modeling helps with the mundane but extremely necessary data preparation steps needed to properly run any kind of model:

- Assigning the correct data types
- Imputation for missing values
- Removing features that might skew the model or slow it down
- Perform one-hot encoding of variables that need it

In the next beta, it will also let you select the best model among Decision Tree, Logistic Regression and Random Forest, giving you a performance estimate but it doesn't really compare with more sophisticated tools.

While veterans of the fields might argue that both automated and assisted modeling are taking care of the "easy" stuff, assisted modeling is potentially a game changer.

On datasets with a lot of features, crafting and evaluating them is a long and tedious, ripetitive job that can and does often take away precious time from modeling and reaching business objectives, so anything that can help speeding up this task can be both a huge productivity boost and a huge help in avoiding pitfalls for less experienced analysts, turning them into effective Citizen Data Scientists.

Is the sexiest job of the decade being automated away? 
Not exactly, however in the next couple of years it's likely you will not need a full fledged Data Scientist to obtain good results in 80% of the tasks.
This does not mean that businesses will do away with them, however they will become more of a specialist role, being employed in places where extracting every last percentage of performance matters, assembling exceptionally tricky datasets or working on hard problems with unstructured data, where a business analyst will not be able to perform well, even when assisted by advanced analytics tools.

Data Science unicorns will be used sparingly, in a similar way as C and Assembly are used in programming: powerful assets where employed, but with the bulk of the work done in a more mundane and industrialized way.
In most other tasks involving machine learning, Citizen Data Scientists will be the slightly less performing but more cost-effective alternative.

With the proper advanced analytics support, very little custom optimization is going to be needed and machine learning will be a very powerful tool of the trade but not an arcane skill like it was in the past.

This concludes our article, stay tuned for new content!



No comments:

Post a Comment