Wednesday, 2 January 2019

5 Myths about Data Scientists

Hello and welcome to Open Citizen Data Science!

It's 2019 and while Data Science is starting to become more understood by businesses there are still many myths about what a Data Scientist can do.

Do you remember 1980s and early 1990s movies that featured programmers or hackers? They appeared like magicians that could do the impossible just by having a computer handed to them.

Likewise, today we live in an era where Machine Learning is enabling self-driving cars and voice assistants, making some of the myths of the 1980s closer to reality:

The old Knightrider series is starting to become close to reality: "Alexa, drive my car here" is a plausible scenario today 

Let's start with a personal note:

I started my journey working with Data Scientists in a relatively recent time, in 2016.
The company I work for tried machine learning before, with mixed results that were very hard to use because the process was fully externalize and communication delays meant that it took far too long to receive the predictions to make them useful.

As a Business Analyst I was assigned to support external consultants in a project to insource at least part of the process in order to make the predictions fast enough to be useful.

Back then (not so long ago!) at least here in Italy "Data Scientist" was a pretty much obscure title and the claims of their prowess were pretty much outlandish: "Give us your data, we'll give you the future".

This was not surprising, because this is how a Data Scientist was sold:

Don't worry, our data wizards will solve your problems!
Today, this is a bit less prevalent but in many ways it's just because the buzzwords changed, bringing us to our first myth.

Myth 1: Data Scientists use complex and hard to explain Deep Learning "AI " to solve problems

How often did you hear about a "Deep Learning" model or "AI" being used to beat humans at some task or to create content in usually impossibly complicated to explain ways?

Personally, I lost count of how many start-ups claimed to solve every problem with a slightly different flavor of Neural Network, only to find out that their offer applies too often only to use-cases that are pretty niche in most businesses that do not have their core competencies revolving around data collection.

Reality: Most business cases are focused on regression or classification of structured data

From a business management perspective, the prevalent needs and use cases tend to be focused in roughly three different categories:

- Increasing sales
- Reducing costs
- Protecting the customer base

Most data available is usually from orders, billing and CRM systems, where they are mostly available either through relational databases of some SQL flavor or in a bunch of text and Excel tables.
Business also needs to be able to actually act on the results, which leads to a requirement for the model to be explainable, something Deep Learning is not good and where a Decision Tree or a simple Logistic Regression excels.

This also brings us to our second myth.

Myth 2: The most important skill of a Data Scientist is the ability of creating powerful models

Still too often today what is being sold is a "customized model that will bring your business a competitive advantage" and what is released is a bunch of R or Python code that has to be maintained, earning consulting firms a steady revenue stream, while the business is convinced that it's too complex to implement internally.

Reality: Most Data Science projects are done using open Machine Learning libraries

What is really inside the code is usually 90% data wrangling to assembly datasets and creating variables with relatively few lines calling one or a few Machine Learning libraries to train the dataset then score on test or production data. 

When taken away from the code and explained in a logical way the process will sound awfully familiar to many analysts who often have been assembling the same data for analysis for years, leading to our third myth.

Myth 3: Every Data Science projects requires advanced Statistics and Machine Learning skills

First of all, a cautionary note: Data Science Project will not be successful without knowing at least some statistics. Medians, Quartiles, Skewness and in general distribution analysis are a must, as well as knowing how to properly create training and test sets.
Likewise, not knowing which models to apply to an use case will just result in wasted time and computing power.

That said, too often what is being sold as Data Scientist is a CS graduate that did some Machine Learning classes or obtained a certification somewhere.
Most of the times, they will get the job done and the customer will be at least temporarily satisfied.

Reality: Data Preparation matters

What will differentiate a skilled Data Scientist is the amount of time spent analyzing and preparing the data itself.
Most unexperienced practitioners will focus on quickly assembling a data-set and trying the latest (or most familiar) models, while more skilled ones will often claim to be using a "Kaggle-winner" system.

This often will bring results in short term, even good ones. Most Machine Learning libraries are very good at extracting correlations from a dataset and will bring decent results even with sub-optimal data.

Again, this results in a happy customer at the delivery date but the difference is usually seen within a few months, as the model performance will start being unstable in a production environment.
When called for maintenance an unexperienced practitioner will have trouble finding out the root cause and most fixes will be temporary if successful at all.

What makes the difference then? 

The least appreciated part of the job, yet the most important


An experienced Data Scientist will make sure the data is always consistent, will remove variables that are too closely correlated to each other and/or have too little variance in their distribution. 
This takes a lot of time of course, making the business side uncomfortable at the apparent lack of results in most of the project especially as too often the practitioner would have started churning test scores pretty early.

Myth 4: Data Scientists can manage a project alone effectively

The first approach to Data Science by a business is often to prepare a bunch of data, encrypt the sensitive parts and hand it to an external team, expecting results within a few months.
Usually there is even a success fee involved "to make sure the business partner will give their best effort".

This is usually paired to a "Proof of Concept" approach that is very attractive because it's relatively cheap and doesn't tie up valuable internal team who are often too busy with the daily business to effectively follow a complex project.

The approach seems great at first, all effort done is usually a few e-mails exchanged giving a brief introduction to the data and the status update meetings are filled with enticing presentations about the data itself, with excellent results anticipated.

Towards the end accuracy scores are given as result and everything seems to be great!
Managers are happy, anticipating being claimed as being "innovators on a small budget" and the partners are probably already counting the success fee.

Some times this works well, but often there are unexpected turns: Maybe the great score is tied to an error in data extraction (some system may be "leaking" actual data on what is supposed to be past datasets) or the results turn out to be not actionable due to their conditions being too complex.

Reality: Domain expertise and team work are the key to success

The "Lone wolf" approach tends to be sub-optimal for a simple reason: Data Science usually involves much more data than what is usually needed by a BI team and it often takes a strain on IT systems.

That bunch of data that is being handed tends to have a few questions embedded:

- Has it been properly historicized or is the system giving you actual data?
- Is the data consistent? Did the fields change with time or have they been replaced by data from a different place?
- Are field names meaningful? Is data properly explained?
- Can all the tables linked easily?
- Is there any data reclassification needed?
- Can this amount of data be provided by IT at regular intervals and fast enough to be used?

An external Data Science team will not be able to properly diagnose everything and is needed to check the validity of the data being given at start.
Not only that: during statistical analysis correlations will be found but P Value alone will not be enough.

Only someone with domain expertise (someone who uses the data daily) can answer to "Does this correlation make business sense?" and "Can we use the results obtained in this way?"

Myth 5: Data Science is something only large firms who that afford expensive teams can use

Without a doubt, Data Scientists can be expensive as demand is still high and certified personnel is scarce.
Someone that can manage large-scale project will cost as much as senior developers and even more junior positions will cost at least 20% more than a Business Analyst of similar seniority.

Still, the skillset needed for Data Science is not something new in itself but rather a mix of Statistical, IT and Business skills.
Your business may not have an individual with all the required skill but might be able to find someone with enthusiasm about analysis that could effectively do Machine Learning with a little investment in training or assemble a team with the needed skill mix.

Reality: Advanced Analytics software can help reducing the skill gap

Senior Data Scientists may find this controversial but commercial advanced analytics solutions have reached a sufficient maturity that enables them to greatly simplify the management of the most common Data Science use cases.

With enough choice for different needs and budgets, these tools can often fill the gap
From Data Preparation to model deployment, the right software can often manage most aspects of a projects in a nearly code-less environment, allowing for more time to focus on the business problem and contribution from people with great domain expertise but little to no coding skills.

Depending on the business needs and budget, they range from free (KNIME and RapidMiner for example have free options for small volumes of data) to products with a dazzling array of features aimed for large firms with substantial Data Science Teams.

You can find excellent solutions for smaller businesses in the 3-5K($/€/£)/Year range, often with features that could also improve a BI team productivity.

Conclusion: Different skills for different needs

Would you hire a Picasso-level painter to redecorate your house?
Unlikely, possibly if you had an unlimited budget.
At the same time, not every Data Science project needs a super-star level engineer.

It is possible for a business to grow a Domain Expert into an in-house Data Scientists, that with some limitations, with the right team and software support will bring sustainable results on realistic costs.

This figure is starting to become relevant and it's the Citizen Data Scientist.



No comments:

Post a Comment