Open Citizen Data Science: September 2020

KDNuggets recently published an article arguing that Data Scientists should be more end-to-end: in theory this is a fine idea, developing a more comprehensive skill-set can bring them closer to the "Unicorn Data Scientists" and lead to an highly paid position.

In my opinion, this is a mistake and reflects an outdated concept of the role. This kind of "full-stack" data scientists tends to become jack of all trades and master of none.
There is already a job title for those: "Business Analyst with knowledge of statistical modelling and R or Python proficiency".
By the way, this is often seen as an entry-level position on Linkedin, while a slightly better paid counterpart is "Python programmer with knowledge of machine learning libraries".

Today, most of the data science stack can easily be done in a low-code environment with similar results for anything related to non-deep learning tasks.
As for deep learning, GPT-3 is the herald of the AI-via-API era, where most companies will use very complex models simply by feeding them their non-structured data and getting results in return, with prices low enough to make a custom development not attractive in a majority of use cases.

Data scientists have a role today but it's in two very specific niches:

1) Academic cutting edge: either in a research position or translating the latest academic papers into working code, they are employed in companies where every single decimal point of additional performance matters and moves million-dollar decisions every day. This is where coding is needed in addition to statistics.

2) Dataset artisan: on the very opposite end, working closely with data engineers it specializes in building high-quality datasets that are fine-tuned to the business problem and is able to find ways to integrate data with the most disparate sources, translating them into relevant variables. Here knowledge of the data pipelines in useful, but most of the work can easily be done in a code-less environment.

Most other positions will be under pressure by consulting firms selling fresh graduates as "senior" data scientists (more realistically, Python programmers with a fresher on statistics and some training on how to use relevant libraries) on one side, and business analysts using low-code workflow-based solutions on the other.

You might see this as an extreme or unpopular opinion but Gartner shows otherwise:

There are several products that fits the description, with some pushing AutoML solutions at the same time:

- Alteryx is on top by ability to execute thanks to its extreme ease of use. They recently introduced some AutoML features as well. Its extremely versatile tool set allows an analyst to potentially match the output of a small consulting firm.

- SAS is an established contender with a long history, recent versions are starting to become easier for analysts as well but it has a long way ahead in UX and still prohibitive costs

- IBM SPSS is another well-known product, their modeler suite enables competent machine learning workflows without coding

- Tibco Spotfire is a relatively new contender with the ambition of creating a true end-to-end product able to handle everything from data ingestion to advanced visualisation. With some UX improvements and new features they might become a true leader of the pack in a few years.

- Dataiku has also been on a simplification route, although it is strongly focused on machine learning, lacking flexibility on other use cases

- Similarly, Datarobot aims to make autoML easy to use and is successful on this, although its data preparation capabilities are limited and the high cost makes it a niche product

- Knime has long been known as a simpler, visual solution in academia and business, with a good feature set for data science hampered by very basic data cleaning tools and slower performance

- Rapidminer along with Knime is another fallen leader. It offers a relatively simple interface coupled with good ML capabilities but lacks in versatility when compared to newer solutions

Unless you're working as a Data Scientist in a start-up where personnel is always short and being able to cover other roles is a specific need, the "jack of all trades" data scientist role is dying:

- Most medium and large companies have now a consolidated data environment, especially after GDPR.

- You will rarely be working on new data sources and will be mostly relying on an enterprise data warehouse, usually controlled by IT operations.

- Machine learning libraries are now pretty stable for non-deep learning tasks, so it's hard to differentiate from a consultant and low-code tools are able to integrate them in a visual workflow with relatively low effort, especially as classification and clustering tends to be the majority of use cases.

- Deep learning could be able to make a difference but despite being widely publicised, non-structured data is not easy to integrate with many business cases and often it's mostly about sentiment analysis, text transcript and similar scenarios, for which APIs are extremely cheap

- GPT-3 and similar generalist models are also reducing the number of use-cases where a custom deep learning setup is justified, especially as GPU-based models are extremely CAPEX-heavy.

This is also based on my own personal experience: in 4 years we went from a custom R model (consultant-developed), successively integrating a Python model into Alteryx (dataset internally developed, model written by consultants), to finally using Alteryx ML tools for everything, with increased data quality control providing better results with simple random forests over the old approach a cutting-edge complex multi-model stacking algorithm.

For non-structured data we rely on third party APIs, both for common use cases like sentiment analysis and more specialised ones, with low costs and results that are more than sufficient for the business needs, although if needed we're able to develop our own solutions internally.

All in all: unless you work in a start-up and plan to take a management role later on, specialise or be marginalised, that's what the market demands today.

An earlier version of this article was published by me on Linkedin: https://www.linkedin.com/pulse/mistake-full-stack-data-scientists-marco-zara/?trackingId=7WgX4J3MTzYEI2nNTlKfEA%3D%3D

Open Citizen Data Science

Sunday, 20 September 2020

"Full-stack" data scientist: an outdated solution to a disappearing problem

Blog Archive

Welcome to Open Citizen Data Science!