Open Citizen Data Science: 2020

Sunday 4 October 2020

Accessible AI: Elon Musk is focusing on the wrong issue

Recently, Elon Musk stated that licensing GPT-3 exclusively to Microsoft is a mistake.

This, of course, started gaining general media attention and both Microsoft and OpenAI are getting criticised for this decision, with the general mood being that exclusivity would harm AI's accessibility.

Broad access to artificial intelligence is of course a very important issue and one of the aims of the OpenAI consortium. However, being accessible is a very broad definition and ironically Microsoft's exclusivity could in many ways make GPT-3 more accessible than it would have been otherwise.

Counter-intuitive? Indeed it is. Let's explore some of the limits of GPT-3 and OpenAPI's accessibility:

GPT-3 was never going to be accessible to everyone:
OpenAPI publicly states that potential users are subject to a pre-approval process

GPT-3's training process is not fully reproducible:
Yes, you can see the code on github and the paper describes the data sources but anyone who works on data science knows very well that the data cleaning and optimisation process has a profound impact on model performance and that's not available publicly

GPT-3 deep learning training cost is beyond the means of most companies:
With a training cost alone estimated between $4M and $12M, it's obvious that it's priced well above what most companies can afford. Renting the API is going to be relatively affordable but there would be limits on control and customisation.

Finally, GPT-3 has extremely limited ability to explain its decision process, just like any other deep learning-based process:
Ironically, Elon himself warned about the dangers of unrestricted AI, but he seems blind about the issues of a black-box system.

This is not meant as criticism to the validity of OpenAPI's achievement, which is indeed technically impressive, but deep learning is clearly an increasingly capital-intensive road to AI, with serious limitations on transparency.

Truly accessible AI is still an unsolved problem that lacks a clear framework:
No commercial entity is likely to develop a truly transparent and easily reproducible process on its own for obvious reasons (among which the difficulty of keeping a profit margin on something easily copied), the development of accessible, transparent AI instead would require a broad, inter-disciplinary research work from academia to define a theoretical base and some advancements both in transparent machine learning and AI ethics.

While this article doesn't claim to have a solution, we can at least try to point at some very generic minimum requirements for a truly accessible artificial intelligence.

Truly accessible AI must be transparent in its decisions:
This is probably the most technically challenging point. The current state of the art forces a decision maker to choose between performance (especially on non-structured data) and transparency.
While there is on-going research on making deep learning decisions easier to explain, it still lacks the clarity of older classifiers such as decision trees.
The latter are much weaker in performance (and are nowadays used mostly as ensembles), especially on non-structured data, so further research is necessary in finding a more performing, transparent alternative.

Training set, data sources and any data treatment must be accessible:
Just like humans, machine learning algorithms are strongly influenced by the quality of the training set and this translates into inheriting both conscious and unconscious biases of whoever creates the training set. Xiaoice VS Zo is an excellent example of how training data influences AI behaviour.
To compensate for this issue, the entire data treatment process must be transparent to ensure that any potential bias and error in generating training variables can be detected and fixed by the broadest possible audience.

The algorithms must be trainable on consumer or pro-sumer-grade hardware:
Although this might seem a controversial point, if training an AI requires a multi-million dollar investment in hardware, then its use will always be restricted to a small circle of entities able to afford the cost, with everyone else forced to just rent it.
By limiting the hardware requirements to what can be obtained with an HEDT platform, this ensures that an effective AI could be trained and customised by smaller academic institutions and businesses, ensuring a fair competitive environment for all involved actors.

Ideally, as much of the process as possible should be done on a low-code environment:
While this is not strictly a requirement, the use of low-code principles ensures that a much broader audience could both audit and influence an AI training process.
Making the logical process flow accessible to non-coders could ensure that domain experts would be able to effectively have access to the AI development process instead of laying down a list of requirements and effectively delegating a programmer or data scientist into implementing it into code.
Microsoft is offering an alternative solution trough its machine teaching framework, however further simplification can be obtained by using existing software solutions.
Software like Knime or Alteryx can be used to create a visual logical flow and limit the code only to the necessary machine learning libraries instead of coding the entire ETL process.

All in all, Elon Musk might be right in being concerned about GPT-3 accessibility, however no matter how powerful GPT-3 is, it's not the accessible AI we're looking for.

Sunday 20 September 2020

"Full-stack" data scientist: an outdated solution to a disappearing problem

KDNuggets recently published an article arguing that Data Scientists should be more end-to-end: in theory this is a fine idea, developing a more comprehensive skill-set can bring them closer to the "Unicorn Data Scientists" and lead to an highly paid position.

In my opinion, this is a mistake and reflects an outdated concept of the role. This kind of "full-stack" data scientists tends to become jack of all trades and master of none.
There is already a job title for those: "Business Analyst with knowledge of statistical modelling and R or Python proficiency".
By the way, this is often seen as an entry-level position on Linkedin, while a slightly better paid counterpart is "Python programmer with knowledge of machine learning libraries".

Today, most of the data science stack can easily be done in a low-code environment with similar results for anything related to non-deep learning tasks.
As for deep learning, GPT-3 is the herald of the AI-via-API era, where most companies will use very complex models simply by feeding them their non-structured data and getting results in return, with prices low enough to make a custom development not attractive in a majority of use cases.

Data scientists have a role today but it's in two very specific niches:

1) Academic cutting edge: either in a research position or translating the latest academic papers into working code, they are employed in companies where every single decimal point of additional performance matters and moves million-dollar decisions every day. This is where coding is needed in addition to statistics.

2) Dataset artisan: on the very opposite end, working closely with data engineers it specializes in building high-quality datasets that are fine-tuned to the business problem and is able to find ways to integrate data with the most disparate sources, translating them into relevant variables. Here knowledge of the data pipelines in useful, but most of the work can easily be done in a code-less environment.

Most other positions will be under pressure by consulting firms selling fresh graduates as "senior" data scientists (more realistically, Python programmers with a fresher on statistics and some training on how to use relevant libraries) on one side, and business analysts using low-code workflow-based solutions on the other.

You might see this as an extreme or unpopular opinion but Gartner shows otherwise:

There are several products that fits the description, with some pushing AutoML solutions at the same time:

- Alteryx is on top by ability to execute thanks to its extreme ease of use. They recently introduced some AutoML features as well. Its extremely versatile tool set allows an analyst to potentially match the output of a small consulting firm.

- SAS is an established contender with a long history, recent versions are starting to become easier for analysts as well but it has a long way ahead in UX and still prohibitive costs

- IBM SPSS is another well-known product, their modeler suite enables competent machine learning workflows without coding

- Tibco Spotfire is a relatively new contender with the ambition of creating a true end-to-end product able to handle everything from data ingestion to advanced visualisation. With some UX improvements and new features they might become a true leader of the pack in a few years.

- Dataiku has also been on a simplification route, although it is strongly focused on machine learning, lacking flexibility on other use cases

- Similarly, Datarobot aims to make autoML easy to use and is successful on this, although its data preparation capabilities are limited and the high cost makes it a niche product

- Knime has long been known as a simpler, visual solution in academia and business, with a good feature set for data science hampered by very basic data cleaning tools and slower performance

- Rapidminer along with Knime is another fallen leader. It offers a relatively simple interface coupled with good ML capabilities but lacks in versatility when compared to newer solutions

Unless you're working as a Data Scientist in a start-up where personnel is always short and being able to cover other roles is a specific need, the "jack of all trades" data scientist role is dying:

- Most medium and large companies have now a consolidated data environment, especially after GDPR.

- You will rarely be working on new data sources and will be mostly relying on an enterprise data warehouse, usually controlled by IT operations.

- Machine learning libraries are now pretty stable for non-deep learning tasks, so it's hard to differentiate from a consultant and low-code tools are able to integrate them in a visual workflow with relatively low effort, especially as classification and clustering tends to be the majority of use cases.

- Deep learning could be able to make a difference but despite being widely publicised, non-structured data is not easy to integrate with many business cases and often it's mostly about sentiment analysis, text transcript and similar scenarios, for which APIs are extremely cheap

- GPT-3 and similar generalist models are also reducing the number of use-cases where a custom deep learning setup is justified, especially as GPU-based models are extremely CAPEX-heavy.

This is also based on my own personal experience: in 4 years we went from a custom R model (consultant-developed), successively integrating a Python model into Alteryx (dataset internally developed, model written by consultants), to finally using Alteryx ML tools for everything, with increased data quality control providing better results with simple random forests over the old approach a cutting-edge complex multi-model stacking algorithm.

For non-structured data we rely on third party APIs, both for common use cases like sentiment analysis and more specialised ones, with low costs and results that are more than sufficient for the business needs, although if needed we're able to develop our own solutions internally.

All in all: unless you work in a start-up and plan to take a management role later on, specialise or be marginalised, that's what the market demands today.

An earlier version of this article was published by me on Linkedin: https://www.linkedin.com/pulse/mistake-full-stack-data-scientists-marco-zara/?trackingId=7WgX4J3MTzYEI2nNTlKfEA%3D%3D

Wednesday 26 August 2020

Applied AI: The future is low-code

Hello and welcome to Open Citizen Data Science!

2020 is still far from over, yet we've already seen two major NLP releases: Microsoft Turing-NLG and OpenAI GPT3.
Both represents the apex of natural language processing models and apply deep learning on a massive scale, with respectively 17 and 175 billion parameters. In some ways, it also marks the end of an age where a start-up could easily come with competitive AI.

While it's still too early to compare the models and their benefits over older iterations, one thing is clear: the industry is placing its bets on ever-more massive deep learning models that require capital investments in the order of millions of dollars or more to train and deploy, supported by data-center scale hardware, something fewer and fewer companies are able to deploy.

Their models are also growing in terms of general capabilities, covering an increasing number of use-cases. This means that smaller start-ups can only compete on relatively small niches of highly-specialised data, at least until it become attractive to one of the big players and integrated in their AI ecosystem.

On the other side, while this makes creating a successful AI startup much harder, it greatly simplifies the deployment of applied projects by replacing custom developments with AI as a service.

Cognitive API services: a low-cost project enabler

Even just 5 years ago developing a machine learning project was still a complex and often risky endeavour: while there was plenty of R and Python libraries available for free, most of the work was still code and programming related, which meant lots of custom code that ended up hard to modify and integrate in the business environment.

Today, the possibilities offered by advanced analytics software and cloud-based APIs means that as long as you have access to the data you can develop your own AI projects with skill requirements that are similar of what is expected to an advanced Excel/Access user: lots of formulas and a bit of extra scripting.

5 years ago, developing a machine learning-based project would easily take 6 months and $100k or more if you employed consultants.
Today, if you've got the data available from the start you can drastically shorten development time so that an equivalent use case would take 2 months or less and could be mostly or completely developed internally by skilled analysts.

What about costs? Let's take as example a typical churn prediction project.

Hardware: a workstation powerful enough to run non-deep learning algorithms (so no GPU required) can be obtained for approximately $5000 (assuming an i9 or ryzen 9 based configuration and 128GB RAM, double that for Xeon and Threadripper based configurations with 256Gb RAM). We can also estimate $1000-2000 for set-up and back-up infrastructure cost.

Advanced Analytics: approximately $5000 per year but your range may vary depending on your choice.

APIs: cost here is volume-based. For a proof of concept you might even manage to stay within the free tier of many services, while for a reasonably size project you can expect to spend from $1k to $10k per year for each service, let's assume you will need 3 different APIs with $4k each.

This means that the same result or better than 5 years ago will take approximately one-third or less of the time and cost, leaving budget for other projects or plenty of money to expand the scope of the current ones.

You can put together a prototype in a week or less (meaning that even using consultants costs will be low) and build it in a modular way, keeping costs under control.

What are the downsides of this kind of architecture?

While simplicity and low set-up costs are strong advantages, you must exercise some caution before basing mission-critical services on a API service-based architecture.

First of all, while set-up is extremely cheap, costs are volume-based.
There are usually discounts but as requests grows it's easy to go from a few thousands to tens or more and while this is usually competitive with on-premise costs, it becomes harder to justify to management as they are now exposed to expenses that are usually under the IT budget in an on-premise project.

Another issue is unexpected down times. While this is usually a minor issue if you're using one of the major players (Azure, AWS or Google), smaller companies offering niche or low-cost services may have the occasional reliability issue. For example, the recent troubles in Belarus (eastern Europe and baltic countries are home to many extremely interesting services) caused some downtime in a few companies based there.
They are usually fixed within hours to days, so if you're working in batches it's less of an issue but it will translate into your own down-time.

Finally, remember you're renting a service: this means you're subject to terms of use and pricing changes, along with the remote possibility of the service itself being closed. While it's easy to change to a similar service, it can bring a temporary disruption to your projects.

Conclusion: my personal experience

I've started experimenting with this kind of hybrid environment last year and recently developed a business project involving 3 different APIs that brings external data, analyses them for sentiment analysis and semantic affinity enabling significant marketing campaign optimisations with some reuse of existing hardware and advanced analytics assets.

Setup has been done in a few days and yearly costs are projected to be under $10k so I can attest the increased agility benefits. New features can be added at an extremely low cost, however it's not without issues: using several different service providers makes for a complex purchasing structure, which makes for a challenge on keeping a tight timing on project start or pushes you to rely on an intermediary to abstract the process away with an added cost (albeit not a significant one).

I can definitely recommend this kind of process for a batch-processing project, it has the potential to be used on near-real time but that's an avenue in need of further exploration.

This concludes our article, stay tuned for new content!

Saturday 15 August 2020

Big data gaming: the paradigm shift has come

Hello and welcome to Open Citizen Data Science!

While gaming usually isn't something easily associated with Data Science, 2020 is bringing huge changes in a business that is still often seen as child's play.

Data Science isn't just numbers for a presentation anymore

The gaming world right now has most eye fixed on the gaming console generation change, however while it will be a significant technological jump with vastly improved performance (and also a focus on faster data access), that's an evolutionary change rather than a revolution.

The real change is instead coming from Flight Simulator 2020, a relatively niche game, from a definitely not niche company: Microsoft.
The genre and the franchise are not new and have been around for the better part of the last 40 years (the first Flight Simulator is from 1982), however the latest edition brings something no game has ever done before: an unique blend of real world data and AI content generation

Big Data and data science for content generation

Flight Simulator 2020 is not the first game that uses real-world map data or to use AI to generate gaming environments. Procedural content generation is nothing new and has been used to create entire universes in real time before.

The difference here is in the blend of several data sources to reproduce the world as we know it in a simulated environment instead of an abstraction. How has this been possible?

Through data, made available in extreme amounts and from different sources and using many parts of the data science stack as we know it.

Flight Simulator's world generation has its foundation on Bing Maps data: over 2 petabytes of data that are accessed in real-time from the cloud as the players fly through the world.
This allows an high degree of fidelity in recreating city and terrain layouts and is the first layer of real-world data used, while a second layer comes from satellite and fly-by pictures, giving access to photogrammetry data.

The blend of those two sources is fed to an Azure machine learning environment, which is used to classify buildings, landmarks, and environment types from trees to building materials, which in turn generates multiple terabytes of textures and height map data.

Strategic partnerships for external data and AI optimisations

This isn't the only source available to the game. To ensure maximum fidelity, Microsoft partnered with other companies, employing specific AI and real-time data optimisations.

Blackshark.ai provided the algorithms for content generation, turning raw data into objects:

Through deep learning, content is generated on Azure servers and streamed in real-time to the user, recreating every single building and object detected via Bing Maps as realistically as possible, literally enabling the player to find his own home in the game if he wishes.

Meteoblue provides real-time weather data:

This allows for an extremely accurate weather simulation, so that if it's raining in a particular location the player will experience the same condition in-game (with the option of custom weather as well).
They also provide a very transparent weather forecast, where you can see their averaged model or what has been predicted by any single weather model for any location.

Finally, VATSIM is used to provide real-time air traffic control feedback:

This will allow for realistic ATC feedback in the game, making the experience even more immersive.

Raising the bar and creating new business cases

For all its AI prowess, the game is of course not a 100% faithful reproduction of the real world and especially in smaller locations the algorithms are going to fail to properly recreate the environment.
Microsoft itself acknowledges this and several airports have been manually optimised in order to ensure maximum realism.

However, this represents a major shift in expectations: instead of navigating a virtual world and looking for similarities with the real world, players will be able to get in and look for differences, while the general world is going to be an accurate representation of reality.

This enables use cases that were not imaginable before: while simulators were always used for training, one can easily imagine such a software being employed by travel agencies to give previews of touristic locations: an immersive view of the world with a freedom never possible before (and something that 20 years ago would have been a dream in a world of quicktime-powered software that tried to recreate virtual visits).

Similar scenarios could be applied in logistics (real-time traffic data is available for route optimization) training and I'm sure our readers could imagine something closer to their domain.

Personally, I will await with trepidation more examples of what we could call Big Data Gaming both as an analyst and as a gamer, where I haven't seen something this revolutionary since the advent of 3D accelerators, which are now also the GPUs fueling the deep learning algorithms of most major players.

This concludes our article, stay tuned for new content!

Open Citizen Data Science