Saturday 27 August 2022

Data Science unsung heroes: data democracy and domain expertise

 Hello and welcome to Open Citizen Data Science!

Very often (one could say too often!) we keep hearing about this or that algorithm earning world-beating performance and accolades at some kind of problem solving.

While new, more efficient libraries are indeed a great way of improving performance, this tends to lead to a narrow definition of what makes a data scientist, often just being a glorified machine learning engineer.
Even worse, in my experience a worrisome trend is happening: the Data Science equivalent of what script kiddies are to hacking.

In order to understand this phenomenon we should remember the Data Science skill triad:


How do they translate in the real business world?

- Computer Science is often reduced to R and Python knowledge: data wrangling in code and finding the best ways to apply machine learning libraries are indeed valuable skills along with some SQL (or NoSQL for big data environments) to actually be able to retrieve the data from many sources are valuable skills and often the most looked after skill. Bonus points if you have knowledge of deep learning libraries.

- Statistics are the other valuable pillar of business data science: after all, machine learning is the application of inference and without the use of at the very least of decent descriptive statistics it would be extremely hard to obtain decent results and especially difficult to explain them in any way!

What about domain expertise?

Domain expertise in simple terms is the knowledge of what your data actually means.
While business use cases tend to be relatively uniform within a certain industry (for example Churn Prediction in telco or applied time series forecast in finance), domain expertise is what actually allows you to leverage your knowledge of how data is structured to obtain results and also what the limits and potential of your available data are.

No two companies (in a business environment of course) share the exact same dataset and differences in how data is gathered, stored and how business procedures are applied is what makes your data unique.
Furthermore, even within the same class of business machine learning problem, it's very likely that your target variable is calculated differently than your competitor.

Let's take an example I have some familiarity with: Telco churn prediction

This is as a bog-standard use case as it gets, yet let's consider a few potential variables:

- How is your target calculated? Are you going to predict customer deactivation or a churn request?

- How much do you know about your customer? Is it just age and gender or can you effectively leverage geographic data to better understand its socio-demographic context?

- How much of your CRM data can you effectively use? Is it calls? NPS score? Support tickets by typology? Can you do text mining on what they write to you or the conversations with customer care?

- How much do you know about their service usage patterns?

- How often does competition renews their offers in your market? Can you exploit social media trends or store location data?

This is just a small sample of the data involved in one business problem, how much do you think the first two pillars of data science are going to help you in handling it?

Statistics can indeed help you deciding how many instances are necessary to make a variable relevant as well as helping you decide which ones are correlated to your business issues, while computer science skills (better summed up as coding and libraries) can help you finding and applying the best libraries for both data cleaning and machine learning.

Let's not underestimate it, good statistics are needed for optimising your target and good coding will help you both in efficient data cleaning and using the best applied algorithms. 
However, data is only as good as you can effectively understand it and apply the knowledge to empower effective business actions.

In most companies this is the realm of institutional knowledge (is that variable good to use? How do I calculate the specific KPI? Do I need a reclassification?) and business know-how (How do I define my target? How old the data can be before becoming irrelevant?), which is rarely where the technically oriented figures excel.

- A data engineer can explain you how the data is structured and will provide the best ways to feed your algorithms, however has very little view on how the data is actively used

- A machine learning engineer can build and optimize a dataset as well as find the best algorithm for the target variable, however is often limited in his contextual knowledge on data and many creative ways of interpolating seemingly uncorrelated data will escape him (and no, deep learning won't substitute that).

The best solutions often comes from feedback given by the people who either create and interact with raw data (for example, people involved in customer care can give you a lot of feedback on how certain tickets are used on CRM and network support can often explain log behaviour in ways statistics cannot) or those who prepare and present data to management (typically reporting and business analysts) as they typically are much closer to how data is operationally used and have to explain it effectively in order to drive business plans.

Now, the mentioned professionals often do not have the tools and the skillset to deal with the amount and complexity of data we deal with in data science as they either deal with single cases or work mostly with pre-cleaned and aggregate data, however they can and will give you extremely precious insights about possible anomalies, help spotting nasty issues like data leakage and often bring suggestions on how to shape new features for a dataset.

Tunnel vision: a common issue that can significantly impact productivity

Bringing together wildly different fields of expertise in a productive way is of course easier said than done. While collaborative approaches are in theory more efficient, there's always a cost to pay in terms of agility, meaning that in a short term view there is little time to listen (and is often limited to starting phase high level talk) and a focus on obtaining quick wins.

Even with this kind of limitations machine learning in most cases will still bring significant improvements from classical data analysis methods, as the ability of dealing with hundreds of KPIs in non-linear ways will produce stronger performance when set to a narrow enough objective. When measured in a synthetic way in the real world it's easy to match the claims of X times the classical method performance improvement, however this is just the model score prediction VS the target KPI and not the full picture.

In my personal experience, there are a few pitfalls that often prevents a model from being fully exploited once operational:

- Creating the prediction takes too long, meaning that a certain percentage of events will already have happened so they cannot be acted upon, This may or may not be remedied with better data engineering or narrower feature selection.

- The model is basing its scores on predictor variables that focus on customers that are hard to act upon. For example, in churn prediction a model might give a very high score on customers that queried the customer care about contract termination terms, meaning that either the customer has already been dealt with in a reactive way or is likely to already have made up its mind

- The model does not explain which KPIs are affecting the score on a single row, which means that finding the appropriate action will be time consuming or there's a risk of ineffective attempts. 

- It might not be economical to act upon a large part of detected cases, meaning that either we can focus on the very top of the prediction (not a bad approach in itself) or the highest margin cases

- The dataset and sampling are made using "common wisdom", often derived from academia or sector studies, ignoring peculiarities of the dataset and leaving performance on the table. For example, 1 year of past data is often cited as the golden rule for churn prediction, however there are cases where performance doubled by using much shorter time frames!

While not critical failures on themselves, these factors means that a machine learning solution might have trouble in reaching a positive ROI even if it technically reached good results in accurately predicting the target KPI. Most of the times, this is the result of focusing purely on what is statistically correct while ignoring the wider context, a tunnel vision common in teams that are composed purely of data scientists and data engineers.

Using data democracy to improve your model part 1: Start with the right questions!

- Do not just find out the available data and the target variable, ask who is going to operationally use the model prediction and which actions are planned. This can both help on focusing the model on a smaller but better actionable sample and reduce training times

- Actions have a cost. No matter if it's sending a maintenance team or have customer care giving discounts, there might be significant portions of your population where a proactive approach could be too expensive. Finding out which parts can be prioritised can be used to help guiding the prescriptive part of a model or further focus your sampling

Using data democracy to improve your model part 2: Be modular!

- Start your development with a small model based on parts of the data most people are familiar with: this has the advantage of giving end users a prototype to familiarise with and start testing the operational phase. This can help the wider team to troubleshoot possible bugs with the data and give confidence on the project. Remember, even an half-working model often can give benefits over linear analysis!

- Divide your data wrangling by areas of expertise: find out who are the most familiar people with a certain set of KPIs and share your data cleaning and feature engineering approach with them: more often than not, they have valuable input on handling missing data and anomalies, plus this can allow you to create intermediate models showing how integrating that data source improves performance

Using data democracy to improve your model part 3: Use a white box and prescriptive approach

- Unless you're working with unstructured and non-text data, it pays off to show how the variables behave related to your target. If you're short on time you might just use a decision tree (works beautifully both for regression and binary classification plus can be moderately effective to explain a k-mean clustering), however if you're aiming for maximum clarity it might pay off to show how your target variable behaves with each class or bin of your features.

- Aiming for a prescriptive approach can also be useful: defining a set of available actions and potential triggers based on the model features can lead to an effectively impactful model, especially if it's able to show both the most important KPIs involved that led to the suggestions. This can create a positive feedback loop from the users that helps creating ever more effective features for future tuning

Using data democracy in the long run: empowering the end users

Almost no model can be considered a final version: there is always room for improvement! Either new data sources, features that came from experience and applied usage feedback or new ways of looking at existing data means that a machine learning project is part of a continuous improvement process.

After all, data democratization is not just about giving end users access to the data or provide them reports, powerpoints and dazzling dashboards. It is also all that of course, but there is more.
Data democratization is all about impact. No matter how accurate your model and how complex your dataset is, it's all for nothing if the end users cannot easily act on the provided information and readily give feedback on what works, what needs to be improved and where more explanation is needed.

This means that there are several layers to consider:

- Access to data, both raw and ready datasets are useful in order to both enabling collaboration on feature engineering and finding the best way to make actionable information

- To work with data it's extremely important to provide the right tools: this means putting emphasis on flexibility, ease of use and ease of sharing. Low code or no-code tools are the best way to reach an audience that mostly uses Excel so that there is as little friction as possible, transforming your users into Citizen Data Scientists wherever possible

- Every business unit should be engaged on this process. No matter if marketing, sales, product development or customer care, all can contribute on their domain and data scientists can greatly benefit from better understanding process funnels

- Finally, building a business culture revolving about using data to solve issues and finding solutions is necessary for long term success. Data scientists should be part of business talks and not relegated as a data Deus ex machina to be called only when standard solutions fail. At the same time, business users should be free to ask any kind of data question and trained on how ask questions about data and on how to translate business phenomena into variables

Conclusion: Empowering all data workers creates more than the sum of the skillsets 

This is of course a very quick overview about how providing data literacy to domain experts and data democratization can help businesses grow and solve challenges. There is no step by step system on how to achieve this and a lot depends on starting conditions and IT architecture, however in the medium term (3 to 5 years) collaboration brings benefits that no data science team alone can guarantee, improving performance, actionability and ensuring business-wide support when it comes to add new data sources to machine learning models.

In my personal experience using this approach I've seen model performance go over 3X the starting model and giving up to 50% better results than what a data science team using the latest features selection libraries and up to date machine learning algorithms could achieve alone.

All in all, we could sum it up this way:

- A data science team alone can achieve good results but it would focus on raw data and tend to overprioritize linear correlations to the target variable

- Domain experts by themselves will focus on process and common sense variables, relying on feedback from the operations but unable to see wider trends on data

- Collaboration from the two will bring unexpected variables and a better understanding of both data and business environment, creating high performing, actionable models


Sunday 4 October 2020

Accessible AI: Elon Musk is focusing on the wrong issue

Recently, Elon Musk stated that licensing GPT-3 exclusively to Microsoft is a mistake.

This, of course, started gaining general media attention and both Microsoft and OpenAI are getting criticised for this decision, with the general mood being that exclusivity would harm AI's accessibility.


 Broad access to artificial intelligence is of course a very important issue and one of the aims of the OpenAI consortium. However, being accessible is a very broad definition and ironically Microsoft's exclusivity could in many ways make GPT-3 more accessible than it would have been otherwise.

Counter-intuitive? Indeed it is. Let's explore some of the limits of GPT-3 and OpenAPI's accessibility:

GPT-3 was never going to be accessible to everyone: 
OpenAPI publicly states that potential users are subject to a pre-approval process

GPT-3's training process is not fully reproducible:
Yes, you can see the code on github and the paper describes the data sources but anyone who works on data science knows very well that the data cleaning and optimisation process has a profound impact on model performance and that's not available publicly

GPT-3 deep learning training cost is beyond the means of most companies:
With a training cost alone estimated between $4M and $12M, it's obvious that it's priced well above what most companies can afford. Renting the API is going to be relatively affordable but there would be limits on control and customisation.

Finally, GPT-3 has extremely limited ability to explain its decision process, just like any other deep learning-based process:
Ironically, Elon himself warned about the dangers of unrestricted AI, but he seems blind about the issues of a black-box system.

This is not meant as criticism to the validity of OpenAPI's achievement, which is indeed technically impressive, but deep learning is clearly an increasingly capital-intensive road to AI, with serious limitations on transparency.

Truly accessible AI is still an unsolved problem that lacks a clear framework:
No commercial entity is likely to develop a truly transparent and easily reproducible process on its own for obvious reasons (among which the difficulty of keeping a profit margin on something easily copied), the development of accessible, transparent AI instead would require a broad, inter-disciplinary research work from academia to define a theoretical base and some advancements both in transparent machine learning and AI ethics.

While this article doesn't claim to have a solution, we can at least try to point at some very generic minimum requirements for a truly accessible artificial intelligence.

Truly accessible AI must be transparent in its decisions:
This is probably the most technically challenging point. The current state of the art forces a decision maker to choose between performance (especially on non-structured data) and transparency.
While there is on-going research on making deep learning decisions easier to explain, it still lacks the clarity of older classifiers such as decision trees.
The latter are much weaker in performance (and are nowadays used mostly as ensembles), especially on non-structured data, so further research is necessary in finding a more performing, transparent alternative.

Training set, data sources and any data treatment must be accessible:
Just like humans, machine learning algorithms are strongly influenced by the quality of the training set and this translates into inheriting both conscious and unconscious biases of whoever creates the training set. Xiaoice VS Zo is an excellent example of how training data influences AI behaviour.
To compensate for this issue, the entire data treatment process must be transparent to ensure that any potential bias and error in generating training variables can be detected and fixed by the broadest possible audience.

The algorithms must be trainable on consumer or pro-sumer-grade hardware:
Although this might seem a controversial point, if training an AI requires a multi-million dollar investment in hardware, then its use will always be restricted to a small circle of entities able to afford the cost, with everyone else forced to just rent it. 
By limiting the hardware requirements to what can be obtained with an HEDT platform, this ensures that an effective AI could be trained and customised by smaller academic institutions and businesses, ensuring a fair competitive environment for all involved actors.

Ideally, as much of the process as possible should be done on a low-code environment:
While this is not strictly a requirement, the use of low-code principles ensures that a much broader audience could both audit and influence an AI training process. 
Making the logical process flow accessible to non-coders could ensure that domain experts would be able to effectively have access to the AI development process instead of laying down a list of requirements and effectively delegating a programmer or data scientist into implementing it into code.
Microsoft is offering an alternative solution trough its machine teaching framework, however further simplification can be obtained by using existing software solutions.
Software like Knime or Alteryx can be used to create a visual logical flow and limit the code only to the necessary machine learning libraries instead of coding the entire ETL process.

All in all, Elon Musk might be right in being concerned about GPT-3 accessibility, however no matter how powerful GPT-3 is, it's not the accessible AI we're looking for. 


Sunday 20 September 2020

"Full-stack" data scientist: an outdated solution to a disappearing problem

KDNuggets recently published an article arguing that Data Scientists should be more end-to-end: in theory this is a fine idea, developing a more comprehensive skill-set can bring them closer to the "Unicorn Data Scientists" and lead to an highly paid position.



In my opinion, this is a mistake and reflects an outdated concept of the role. This kind of "full-stack" data scientists tends to become jack of all trades and master of none.
There is already a job title for those: "Business Analyst with knowledge of statistical modelling and R or Python proficiency".
By the way, this is often seen as an entry-level position on Linkedin, while a slightly better paid counterpart is "Python programmer with knowledge of machine learning libraries".

Today, most of the data science stack can easily be done in a low-code environment with similar results for anything related to non-deep learning tasks.
As for deep learning, GPT-3 is the herald of the AI-via-API era, where most companies will use very complex models simply by feeding them their non-structured data and getting results in return, with prices low enough to make a custom development not attractive in a majority of use cases.

Data scientists have a role today but it's in two very specific niches:

1) Academic cutting edge: either in a research position or translating the latest academic papers into working code, they are employed in companies where every single decimal point of additional performance matters and moves million-dollar decisions every day. This is where coding is needed in addition to statistics.

2) Dataset artisan: on the very opposite end, working closely with data engineers it specializes in building high-quality datasets that are fine-tuned to the business problem and is able to find ways to integrate data with the most disparate sources, translating them into relevant variables. Here knowledge of the data pipelines in useful, but most of the work can easily be done in a code-less environment.

Most other positions will be under pressure by consulting firms selling fresh graduates as "senior" data scientists (more realistically, Python programmers with a fresher on statistics and some training on how to use relevant libraries) on one side, and business analysts using low-code workflow-based solutions on the other.

You might see this as an extreme or unpopular opinion but Gartner shows otherwise:


There are several products that fits the description, with some pushing AutoML solutions at the same time:

- Alteryx is on top by ability to execute thanks to its extreme ease of use. They recently introduced some AutoML features as well. Its extremely versatile tool set allows an analyst to potentially match the output of a small consulting firm.

- SAS is an established contender with a long history, recent versions are starting to become easier for analysts as well but it has a long way ahead in UX and still prohibitive costs

- IBM SPSS is another well-known product, their modeler suite enables competent machine learning workflows without coding

- Tibco Spotfire is a relatively new contender with the ambition of creating a true end-to-end product able to handle everything from data ingestion to advanced visualisation. With some UX improvements and new features they might become a true leader of the pack in a few years.

- Dataiku has also been on a simplification route, although it is strongly focused on machine learning, lacking flexibility on other use cases

- Similarly, Datarobot aims to make autoML easy to use and is successful on this, although its data preparation capabilities are limited and the high cost makes it a niche product

- Knime has long been known as a simpler, visual solution in academia and business, with a good feature set for data science hampered by very basic data cleaning tools and slower performance

- Rapidminer along with Knime is another fallen leader. It offers a relatively simple interface coupled with good ML capabilities but lacks in versatility when compared to newer solutions


Unless you're working as a Data Scientist in a start-up where personnel is always short and being able to cover other roles is a specific need, the "jack of all trades" data scientist role is dying:

- Most medium and large companies have now a consolidated data environment, especially after GDPR.

- You will rarely be working on new data sources and will be mostly relying on an enterprise data warehouse, usually controlled by IT operations.

- Machine learning libraries are now pretty stable for non-deep learning tasks, so it's hard to differentiate from a consultant and low-code tools are able to integrate them in a visual workflow with relatively low effort, especially as classification and clustering tends to be the majority of use cases.

- Deep learning could be able to make a difference but despite being widely publicised, non-structured data is not easy to integrate with many business cases and often it's mostly about sentiment analysis, text transcript and similar scenarios, for which APIs are extremely cheap

- GPT-3 and similar generalist models are also reducing the number of use-cases where a custom deep learning setup is justified, especially as GPU-based models are extremely CAPEX-heavy.


This is also based on my own personal experience: in 4 years we went from a custom R model (consultant-developed), successively integrating a Python model into Alteryx (dataset internally developed, model written by consultants), to finally using Alteryx ML tools for everything, with increased data quality control providing better results with simple random forests over the old approach a cutting-edge complex multi-model stacking algorithm.

For non-structured data we rely on third party APIs, both for common use cases like sentiment analysis and more specialised ones, with low costs and results that are more than sufficient for the business needs, although if needed we're able to develop our own solutions internally.

All in all: unless you work in a start-up and plan to take a management role later on, specialise or be marginalised, that's what the market demands today.

An earlier version of this article was published by me on Linkedin: https://www.linkedin.com/pulse/mistake-full-stack-data-scientists-marco-zara/?trackingId=7WgX4J3MTzYEI2nNTlKfEA%3D%3D

Wednesday 26 August 2020

Applied AI: The future is low-code

Hello and welcome to Open Citizen Data Science!

2020 is still far from over, yet we've already seen two major NLP releases: Microsoft Turing-NLG and OpenAI GPT3.
Both represents the apex of natural language processing models and apply deep learning on a massive scale, with respectively 17 and 175 billion parameters. In some ways, it also marks the end of an age where a start-up could easily come with competitive AI.


While it's still too early to compare the models and their benefits over older iterations, one thing is clear: the industry is placing its bets on ever-more massive deep learning models that require capital investments in the order of millions of dollars or more to train and deploy, supported by data-center scale hardware, something fewer and fewer companies are able to deploy.

Their models are also growing in terms of general capabilities, covering an increasing number of use-cases. This means that smaller start-ups can only compete on relatively small niches of highly-specialised data, at least until it become attractive to one of the big players and integrated in their AI ecosystem.

On the other side, while this makes creating a successful AI startup much harder, it greatly simplifies the deployment of applied projects by replacing custom developments with AI as a service.

Cognitive API services: a low-cost project enabler

Even just 5 years ago developing a machine learning project was still a complex and often risky endeavour: while there was plenty of R and Python libraries available for free, most of the work was still code and programming related, which meant lots of custom code that ended up hard to modify and integrate in the business environment.

Today, the possibilities offered by advanced analytics software and cloud-based APIs means that as long as you have access to the data you can develop your own AI projects with skill requirements that are similar of what is expected to an advanced Excel/Access user: lots of formulas and a bit of extra scripting.


5 years ago, developing a machine learning-based project would easily take 6 months and $100k or more if you employed consultants.
Today, if you've got the data available from the start you can drastically shorten development time so that an equivalent use case would take 2 months or less and could be mostly or completely developed internally by skilled analysts.

What about costs? Let's take as example a typical churn prediction project.

Hardware: a workstation powerful enough to run non-deep learning algorithms (so no GPU required) can be obtained for approximately $5000 (assuming an i9 or ryzen 9 based configuration and 128GB RAM, double that for Xeon and Threadripper based configurations with 256Gb RAM). We can also estimate $1000-2000 for set-up and back-up infrastructure cost.

Advanced Analytics: approximately $5000 per year but your range may vary depending on your choice.

APIs: cost here is volume-based. For a proof of concept you might even manage to stay within the free tier of many services, while for a reasonably size project you can expect to spend from $1k to $10k per year for each service, let's assume you will need 3 different APIs with $4k each.

This means that the same result or better than 5 years ago will take approximately one-third or less of the time and cost, leaving budget for other projects or plenty of money to expand the scope of the current ones.

You can put together a prototype in a week or less (meaning that even using consultants costs will be low) and build it in a modular way, keeping costs under control.

What are the downsides of this kind of architecture?



While simplicity and low set-up costs are strong advantages, you must exercise some caution before basing mission-critical services on a API service-based architecture.

First of all, while set-up is extremely cheap, costs are volume-based.
There are usually discounts but as requests grows it's easy to go from a few thousands to tens or more and while this is usually competitive with on-premise costs, it becomes harder to justify to management as they are now exposed to expenses that are usually under the IT budget in an on-premise project.

Another issue is unexpected down times. While this is usually a minor issue if you're using one of the major players (Azure, AWS or Google), smaller companies offering niche or low-cost services may have the occasional reliability issue. For example, the recent troubles in Belarus (eastern Europe and baltic countries are home to many extremely interesting services) caused some downtime in a few companies based there. 
They are usually fixed within hours to days, so if you're working in batches it's less of an issue but it will translate into your own down-time. 

Finally, remember you're renting a service: this means you're subject to terms of use and pricing changes, along with the remote possibility of the service itself being closed. While it's easy to change to a similar service, it can bring a temporary disruption to your projects.

Conclusion: my personal experience

I've started experimenting with this kind of hybrid environment last year and recently developed a business project involving 3 different APIs that brings external data, analyses them for sentiment analysis and semantic affinity enabling significant marketing campaign optimisations with some reuse of existing hardware and advanced analytics assets.

Setup has been done in a few days and yearly costs are projected to be under $10k so I can attest the increased agility benefits. New features can be added at an extremely low cost, however it's not without issues: using several different service providers makes for a complex purchasing structure, which makes for a challenge on keeping a tight timing on project start or pushes you to rely on an intermediary to abstract the process away with an added cost (albeit not a significant one).

I can definitely recommend this kind of process for a batch-processing project, it has the potential to be used on near-real time but that's an avenue in need of further exploration.

This concludes our article, stay tuned for new content!

Saturday 15 August 2020

Big data gaming: the paradigm shift has come

 Hello and welcome to Open Citizen Data Science!

While gaming usually isn't something easily associated with Data Science, 2020 is bringing huge changes in a business that is still often seen as child's play.


Data Science isn't just numbers for a presentation anymore

The gaming world right now has most eye fixed on the gaming console generation change, however while it will be a significant technological jump with vastly improved performance (and also a focus on faster data access), that's an evolutionary change rather than a revolution.

The real change is instead coming from Flight Simulator 2020, a relatively niche game, from a definitely not niche company: Microsoft.
The genre and the franchise are not new and have been around for the better part of the last 40 years (the first Flight Simulator is from 1982), however the latest edition brings something no game has ever done before: an unique blend of real world data and AI content generation

Big Data and data science for content generation

Flight Simulator 2020 is not the first game that uses real-world map data or to use AI to generate gaming environments. Procedural content generation is nothing new and has been used to create entire universes in real time before.

The difference here is in the blend of several data sources to reproduce the world as we know it in a simulated environment instead of an abstraction. How has this been possible?

Through data, made available in extreme amounts and from different sources and using many parts of the data science stack as we know it.

Flight Simulator's world generation has its foundation on Bing Maps data: over 2 petabytes of data that are accessed in real-time from the cloud as the players fly through the world.
This allows an high degree of fidelity in recreating city and terrain layouts and is the first layer of real-world data used, while a second layer comes from satellite and fly-by pictures, giving access to photogrammetry data.

The blend of those two sources is fed to an Azure machine learning environment, which is used to classify buildings, landmarks, and environment types from trees to building materials, which in turn generates multiple terabytes of textures and height map data.

Strategic partnerships for external data and AI optimisations

This isn't the only source available to the game. To ensure maximum fidelity, Microsoft partnered with other companies, employing specific AI and real-time data optimisations.

Blackshark.ai provided the algorithms for content generation, turning raw data into objects:


Through deep learning, content is generated on Azure servers and streamed in real-time to the user, recreating every single building and object detected via Bing Maps as realistically as possible, literally enabling the player to find his own home in the game if he wishes.

Meteoblue provides real-time weather data:


This allows for an extremely accurate weather simulation, so that if it's raining in a particular location the player will experience the same condition in-game (with the option of custom weather as well).
They also provide a very transparent weather forecast, where you can see their averaged model or what has been predicted by any single weather model for any location.

Finally, VATSIM is used to provide real-time air traffic control feedback:


This will allow for realistic ATC feedback in the game, making the experience even more immersive.

Raising the bar and creating new business cases

For all its AI prowess, the game is of course not a 100% faithful reproduction of the real world and especially in smaller locations the algorithms are going to fail to properly recreate the environment.
Microsoft itself acknowledges this and several airports have been manually optimised in order to ensure maximum realism. 

However, this represents a major shift in expectations: instead of navigating a virtual world and looking for similarities with the real world, players will be able to get in and look for differences, while the general world is going to be an accurate representation of reality. 

This enables use cases that were not imaginable before: while simulators were always used for training, one can easily imagine such a software being employed by travel agencies to give previews of touristic locations: an immersive view of the world with a freedom never possible before (and something that 20 years ago would have been a dream in a world of quicktime-powered software that tried to recreate virtual visits).

Similar scenarios could be applied in logistics (real-time traffic data is available for route optimization) training and I'm sure our readers could imagine something closer to their domain.

Personally, I will await with trepidation more examples of what we could call Big Data Gaming both as an analyst and as a gamer, where I haven't seen something this revolutionary since the advent of 3D accelerators, which are now also the GPUs fueling the deep learning algorithms of most major players.

This concludes our article, stay tuned for new content!

Sunday 20 October 2019

Twilight of the unicorn Data Scientist: how the sexiest job of this decade is going to change


Hello and welcome to Open Citizen Data Science!

After a very interesting two days full-immersion in the world of citizen data science at Alteryx Inspire in London, some trends are starting to become clear.



Thursday 26 September 2019

6 Red Flags to avoid for a successful partnership with Data Science Consulting firms

Hello and welcome to Open Citizen Data Science!

 In many industries Data Science is still a relatively new field and as such it's likely that hiring a consultant to try new ways to harness the power of available data is seen as a relatively low-risk path.