Sunday, 20 January 2019

Data Wrangling with Alteryx Part II: Understanding your data to avoid pitfalls

Hello and welcome to Open Citizen Data Science!

Following our previous article, we tried to optimize our search by finding the census areas with the largest population inside.
Searching the top 10 most populated places we find that our research covers areas that tend to have wildy different demographic profiles:


Apparently our search also found out that a church has over 4000 people registered as residents!
Some areas also seems to be actually highly populated just because the census area is actually pretty large, meaning that the population density is actually not as high as we hoped.
If our objective is to do market targeting for commercial purposes this is definitely not going to work, so we need to refine our search.



Using the total population alone as ranking factor proved to be misleading so let's review our fields to see what could be useful and select a few:

1) "Popolazione residente - totale" is the total population of the area: while it proved misleading by itself, it's still likely to be useful when used in association with other fields.

2) "Popolazione residente - totale di 15 anni e più appartenente alle forze di lavoro totale" is the total employed population: If they are employed they will likely have an income and that can be used to refine our search

3) "Abitazioni occupate da almeno una persona residente" is the total number of habitational units that are in use: This could be useful to filter out places like the church above

4) "Edifici e complessi di edifici - totale" is the number of different buildings in the area: This can be useful to filter out low density areas

5) "Edifici ad uso residenziale" is how many buildings are used for residential purposes: This could be useful to see if the area is mainly residential or if its use is actually mixed

With a few more fields and all of them being numerical, what can be done is to create new KPIs by making a few indexes out of our fields. Luckily for us, this requires very few steps in Alteryx:

First of all, let's use the Select tool and just select the fields we're going to need to our tables are going to be less cluttered:

- "Codice Nazionale Sezione" is the key field

- "Popolazione residente - totale"

- "Popolazione residente - totale di 15 anni e più appartenente alle forze di lavoro totale"

- "Abitazioni occupate da almeno una persona residente"

- "Edifici e complessi di edifici - totale"

- "Edifici ad uso residenziale"

- "SpatialObj" is the geographical data we need to see where the section actually is


Now that we have a cleaner set of information to work with, let's see how to extract useful information from our data!

The Formula tool allows to create new fields or to modify existing ones by using formulas, in a pretty similar way as it's done with Excel for example. Let's identify a few useful questions we can answer with the available data:

- How many people are employed compared to the resident population?
- How many buildings are used for residential puposes compared to the area total?
- How many people are there in average per residential unit?
- How many residents per residential building?

As you can see, with a few quick calculations we created some useful indexes and also have a few divide by zero errors. This means that there are areas with no residential units, a clear anomaly!

Now let's see how our old top ten would fare with the newly created indexes:

- Employed: This doesn't look bad: even in the worst case, about one people in 4 has a job

- Residential building percentage: There is one clear outlier, where only 33% of the buildings have residential purposes. This could be an anomaly for highly populated areas

- People per residential unit: as a unit should mean an apartment or equivalent we can see 3 problematic values, one with null (meaning there are no apartments!) and two with an enormous number of people inside. Clearly, there are errors in either the number of reported units in in their reported purpose

- Residents per building: the same areas that seems to be anomalies for the previous field are also problematic here.
While over 800 people could be plausible for a pretty big residential complex, values of over 5000 would put those buildings in the world top 10, which clearly are not looking at the very first picture in the article.
On the same field, we also see that there are values of less than 10 residents per building, which implies a low population density, something that is not desired if we wanted to reach a lot of people with little effort!

Two fields are clearly showing problematic areas for us, so let's start with the "people per unit" field and let's see what could be saner values to use.
The summarize tool can be used to get some very quick stats about a field, so let's see averages first:


Average, median and mode shows that in most cases we should expect between 2 and 3 people per unit. Using 1st and 99th percentiles gives us a range between 1 and 5 people per unit, which means that the large majority of people in Italy has relatively few people living in the same place.

This gives us an useful first filter, so let's see what happens to our top ten:


This looks better, and we can see a few benefits:

- Employed people rose to be approximately 1 every 3, meaning richer areas
- Lowest residential usage doubled, meaning that every area is mostly residential

While total population is still in the thousands, residents per building are however still relatively low in 3 areas, meaning that we still need to find more densely inhabited zones.
Using summarize on residents per building tells us that part of the problem is that most areas have relatively small buildings, with even the 95th percentile falling below 40 people:


Let's use the 95th percentile as filter and we can now take a look at a new top 10:
Better but still not good enough. 135 separate buildings still means a  pretty big area and big is not efficient! Summarizing our subset however tells us that 95% of the areas do have 18 buildings or less, so we might be getting closer:

 
Looks like that it did the job, the lowest populated area of our top 10 still has over 1500 people!
Out data wrangling, augmented by some enrichment, gave us a suitable set of areas to be presented as marketing campaign candidates.

Will it be successful? We will see that in our next article!



No comments:

Post a Comment