Open Citizen Data Science: August 2020

Wednesday, 26 August 2020

Applied AI: The future is low-code

Hello and welcome to Open Citizen Data Science!

2020 is still far from over, yet we've already seen two major NLP releases: Microsoft Turing-NLG and OpenAI GPT3.
Both represents the apex of natural language processing models and apply deep learning on a massive scale, with respectively 17 and 175 billion parameters. In some ways, it also marks the end of an age where a start-up could easily come with competitive AI.

While it's still too early to compare the models and their benefits over older iterations, one thing is clear: the industry is placing its bets on ever-more massive deep learning models that require capital investments in the order of millions of dollars or more to train and deploy, supported by data-center scale hardware, something fewer and fewer companies are able to deploy.

Their models are also growing in terms of general capabilities, covering an increasing number of use-cases. This means that smaller start-ups can only compete on relatively small niches of highly-specialised data, at least until it become attractive to one of the big players and integrated in their AI ecosystem.

On the other side, while this makes creating a successful AI startup much harder, it greatly simplifies the deployment of applied projects by replacing custom developments with AI as a service.

Cognitive API services: a low-cost project enabler

Even just 5 years ago developing a machine learning project was still a complex and often risky endeavour: while there was plenty of R and Python libraries available for free, most of the work was still code and programming related, which meant lots of custom code that ended up hard to modify and integrate in the business environment.

Today, the possibilities offered by advanced analytics software and cloud-based APIs means that as long as you have access to the data you can develop your own AI projects with skill requirements that are similar of what is expected to an advanced Excel/Access user: lots of formulas and a bit of extra scripting.

5 years ago, developing a machine learning-based project would easily take 6 months and $100k or more if you employed consultants.
Today, if you've got the data available from the start you can drastically shorten development time so that an equivalent use case would take 2 months or less and could be mostly or completely developed internally by skilled analysts.

What about costs? Let's take as example a typical churn prediction project.

Hardware: a workstation powerful enough to run non-deep learning algorithms (so no GPU required) can be obtained for approximately $5000 (assuming an i9 or ryzen 9 based configuration and 128GB RAM, double that for Xeon and Threadripper based configurations with 256Gb RAM). We can also estimate $1000-2000 for set-up and back-up infrastructure cost.

Advanced Analytics: approximately $5000 per year but your range may vary depending on your choice.

APIs: cost here is volume-based. For a proof of concept you might even manage to stay within the free tier of many services, while for a reasonably size project you can expect to spend from $1k to $10k per year for each service, let's assume you will need 3 different APIs with $4k each.

This means that the same result or better than 5 years ago will take approximately one-third or less of the time and cost, leaving budget for other projects or plenty of money to expand the scope of the current ones.

You can put together a prototype in a week or less (meaning that even using consultants costs will be low) and build it in a modular way, keeping costs under control.

What are the downsides of this kind of architecture?

While simplicity and low set-up costs are strong advantages, you must exercise some caution before basing mission-critical services on a API service-based architecture.

First of all, while set-up is extremely cheap, costs are volume-based.
There are usually discounts but as requests grows it's easy to go from a few thousands to tens or more and while this is usually competitive with on-premise costs, it becomes harder to justify to management as they are now exposed to expenses that are usually under the IT budget in an on-premise project.

Another issue is unexpected down times. While this is usually a minor issue if you're using one of the major players (Azure, AWS or Google), smaller companies offering niche or low-cost services may have the occasional reliability issue. For example, the recent troubles in Belarus (eastern Europe and baltic countries are home to many extremely interesting services) caused some downtime in a few companies based there.
They are usually fixed within hours to days, so if you're working in batches it's less of an issue but it will translate into your own down-time.

Finally, remember you're renting a service: this means you're subject to terms of use and pricing changes, along with the remote possibility of the service itself being closed. While it's easy to change to a similar service, it can bring a temporary disruption to your projects.

Conclusion: my personal experience

I've started experimenting with this kind of hybrid environment last year and recently developed a business project involving 3 different APIs that brings external data, analyses them for sentiment analysis and semantic affinity enabling significant marketing campaign optimisations with some reuse of existing hardware and advanced analytics assets.

Setup has been done in a few days and yearly costs are projected to be under $10k so I can attest the increased agility benefits. New features can be added at an extremely low cost, however it's not without issues: using several different service providers makes for a complex purchasing structure, which makes for a challenge on keeping a tight timing on project start or pushes you to rely on an intermediary to abstract the process away with an added cost (albeit not a significant one).

I can definitely recommend this kind of process for a batch-processing project, it has the potential to be used on near-real time but that's an avenue in need of further exploration.

This concludes our article, stay tuned for new content!

Saturday, 15 August 2020

Big data gaming: the paradigm shift has come

Hello and welcome to Open Citizen Data Science!

While gaming usually isn't something easily associated with Data Science, 2020 is bringing huge changes in a business that is still often seen as child's play.

Data Science isn't just numbers for a presentation anymore

The gaming world right now has most eye fixed on the gaming console generation change, however while it will be a significant technological jump with vastly improved performance (and also a focus on faster data access), that's an evolutionary change rather than a revolution.

The real change is instead coming from Flight Simulator 2020, a relatively niche game, from a definitely not niche company: Microsoft.
The genre and the franchise are not new and have been around for the better part of the last 40 years (the first Flight Simulator is from 1982), however the latest edition brings something no game has ever done before: an unique blend of real world data and AI content generation

Big Data and data science for content generation

Flight Simulator 2020 is not the first game that uses real-world map data or to use AI to generate gaming environments. Procedural content generation is nothing new and has been used to create entire universes in real time before.

The difference here is in the blend of several data sources to reproduce the world as we know it in a simulated environment instead of an abstraction. How has this been possible?

Through data, made available in extreme amounts and from different sources and using many parts of the data science stack as we know it.

Flight Simulator's world generation has its foundation on Bing Maps data: over 2 petabytes of data that are accessed in real-time from the cloud as the players fly through the world.
This allows an high degree of fidelity in recreating city and terrain layouts and is the first layer of real-world data used, while a second layer comes from satellite and fly-by pictures, giving access to photogrammetry data.

The blend of those two sources is fed to an Azure machine learning environment, which is used to classify buildings, landmarks, and environment types from trees to building materials, which in turn generates multiple terabytes of textures and height map data.

Strategic partnerships for external data and AI optimisations

This isn't the only source available to the game. To ensure maximum fidelity, Microsoft partnered with other companies, employing specific AI and real-time data optimisations.

Blackshark.ai provided the algorithms for content generation, turning raw data into objects:

Through deep learning, content is generated on Azure servers and streamed in real-time to the user, recreating every single building and object detected via Bing Maps as realistically as possible, literally enabling the player to find his own home in the game if he wishes.

Meteoblue provides real-time weather data:

This allows for an extremely accurate weather simulation, so that if it's raining in a particular location the player will experience the same condition in-game (with the option of custom weather as well).
They also provide a very transparent weather forecast, where you can see their averaged model or what has been predicted by any single weather model for any location.

Finally, VATSIM is used to provide real-time air traffic control feedback:

This will allow for realistic ATC feedback in the game, making the experience even more immersive.

Raising the bar and creating new business cases

For all its AI prowess, the game is of course not a 100% faithful reproduction of the real world and especially in smaller locations the algorithms are going to fail to properly recreate the environment.
Microsoft itself acknowledges this and several airports have been manually optimised in order to ensure maximum realism.

However, this represents a major shift in expectations: instead of navigating a virtual world and looking for similarities with the real world, players will be able to get in and look for differences, while the general world is going to be an accurate representation of reality.

This enables use cases that were not imaginable before: while simulators were always used for training, one can easily imagine such a software being employed by travel agencies to give previews of touristic locations: an immersive view of the world with a freedom never possible before (and something that 20 years ago would have been a dream in a world of quicktime-powered software that tried to recreate virtual visits).

Similar scenarios could be applied in logistics (real-time traffic data is available for route optimization) training and I'm sure our readers could imagine something closer to their domain.

Personally, I will await with trepidation more examples of what we could call Big Data Gaming both as an analyst and as a gamer, where I haven't seen something this revolutionary since the advent of 3D accelerators, which are now also the GPUs fueling the deep learning algorithms of most major players.

This concludes our article, stay tuned for new content!

Open Citizen Data Science

Wednesday, 26 August 2020

Applied AI: The future is low-code

Saturday, 15 August 2020

Big data gaming: the paradigm shift has come

Blog Archive

Welcome to Open Citizen Data Science!