Sunday, 3 February 2019

Digging deeper in the census data part 2: Age segmentation

Hello and welcome to Open Citizen Data Science!

In this article we will look into the census variables related to population age.
Knowing if a neighbourhood is populated mostly by working age people opposed to pensioners could make a world of difference depending on what is being researched on.

Not only the general tastes of different generations changes but also the kind of services that are required. A lack of schools near a zone with an high percentage of pre-schoolers for example could indicate both a potential business niche for private child care or a place in troubled socio-economic status.


Age alone will not be sufficient to recognize this of course but knowing the prevalent segment can help in differentiating the tone of advertisment in the area for example.
In our previous article we created a list of census variables and gave a first classification:

- Popolazione residente - età < 5 anni: pre-schoolers
- Popolazione residente - età 5 - 9 anni: children
- Popolazione residente - età 10 - 14 anni: children
- Popolazione residente - età 15 - 19 anni: high school students
- Popolazione residente - età 20 - 24 anni: university students and young workers
- Popolazione residente - età 25 - 29 anni: university students and young workers
- Popolazione residente - età 30 - 34 anni: working age
- Popolazione residente - età 35 - 39 anni: working age
- Popolazione residente - età 40 - 44 anni: working age
- Popolazione residente - età 45 - 49 anni: working age
- Popolazione residente - età 50 - 54 anni: working age
- Popolazione residente - età 55 - 59 anni: late career
- Popolazione residente - età 60 - 64 anni: late career
- Popolazione residente - età 65 - 69 anni: retirement age
- Popolazione residente - età 70 - 74 anni: retirement age
- Popolazione residente - età > 74 anni: older retirement

We could also take a median age for each class in order to calculate a rough average:


- Popolazione residente - età < 5 anni: 3
- Popolazione residente - età 5 - 9 anni: 7
- Popolazione residente - età 10 - 14 anni: 12
- Popolazione residente - età 15 - 19 anni: 17
- Popolazione residente - età 20 - 24 anni: 22
- Popolazione residente - età 25 - 29 anni: 27
- Popolazione residente - età 30 - 34 anni: 32
- Popolazione residente - età 35 - 39 anni: 37
- Popolazione residente - età 40 - 44 anni: 42
- Popolazione residente - età 45 - 49 anni: 47
- Popolazione residente - età 50 - 54 anni: 52
- Popolazione residente - età 55 - 59 anni: 57
- Popolazione residente - età 60 - 64 anni: 62
- Popolazione residente - età 65 - 69 anni: 67
- Popolazione residente - età 70 - 74 anni: 72
- Popolazione residente - età > 74 anni: 78 (median with an average life expectancy of 82 in 2011)

Let's use these values and see what the top 10 oldest and youngest zones would be...

The zones wth the oldest population tend to be almost deserted and usually in rural areas

The oldest 10 zones don't look exactly like thriving areas for marketing:


What about the youngest?


Hmm, either we found the real Neverland or it looks like we got more dirty data!
A more rational explanation would be that actual adults are there but they aren't registered as residents but this still means we cannot measure their age.

Unlike previous datasets we need to rely on common sense rather than averages and medians.
In this case, what we want is to filter for areas that have at least one person over 15 years of age, which would still be a pretty extreme case (teen parent living alone with child) but at least explainable in some way.

Let's give it a try:

Still extremely young in average but at least we know there is at least a resident adult in the area.
Let's take a look at the youngest zone:
Areas in development during the census may not give a realistic picture
 Taking into account that Google maps photos are more recent than the census and the pictured area looks like there is still building work going on, it's possible that this is a pretty recent area re-development and likely people just started moving in at the time of the census.

That said, looks like both upper and lower ranges are fairly low populated areas, which lends itself to extremes. Time to get back to statistics to find out a good lower cut-off:


As you can see, even excluding areas without at least one resident over 15 years of age we still have a lot of areas with very little people inside.
The most populated area of our top 10 youngest hovers just above the first decile, meaning that it might be a good indicator of a new starting point, especially considering that population doesn't seem to grow a lot in density in the bottom quarter of the distribution anyway:

Still very young and the youngest is pretty extreme, let's take a look at it:
This is another good example of what happens with focusing on a single metric: Non-residential areas might be populated!
 
Not good, this is an hospital! Looks like we need to filter again, in this case making sure there is at least one residential unit with residents in it:

Youngest now has an average of median ages in class of 16, let's see what this area looks like:
Urban area, buildings tend to have an old and not well-maintained look
What about the oldest ones, given that we changed criteria?
 

A completely new distribution since we excluded zones inhabited in single digits and areas without residential units, but will our oldest be significantly different?
Even using cleaned data, the oldest areas tend to be rural.


A little more populated, but not extremely different.

Looking into its extreme age ranges gives us a picture of very different lives, but is it enough to have a clear picture of the people living there?

Using just the last two pictures we might just have a confirmation of the old, rural VS young, urban stereotype, but is it always true?
Most populated youngest top 10 area
VS
Most populated oldest top 10 area
Both are still in the top 10, yet you can see a very young area in a less urban setting (not as rural as the oldest one but definitely a low density area) and a very old area in an urban setting, although this is a special case: a retirement home.

Stereotypes may hold a speck of thruth but like we just demonstrated, we should not let extreme cases guide our perception.

We new brought down special cases to an acceptable level, so let's try the macro-classification we did earlier.
Let's see the top 10 areas for each segment: 

Top 10 pre-schoolers shows that median class average age is skewed towards younger but not as much as our earlier distribution but no areas where the class is dominant.

Top 10 children are more varied. Still low in median class average age, most entries shows population in more age segments. In all areas at least half of the population is made of this age segment, which might hide some more special cases like this:
An elementary school. Apparently there also was a residential building with 12 people there.



Top 10 for the high school segment shows something interesting, almost like a generational gap: they tend to have more areas populated with older segments but much less with younger. Some areas are also dominated by this segment, let's see the top one:
A student residence. Apparently there are 2 residential buildings but only 1 residential unit!
Looks like areas where minors are a majority also tends to be special cases.

Lets' see if that is the same for the next segment:

Even more concentrated, let's see the top entry:
Another student residence
Italian universities usually don't look like colleges in USA but there are still facilities dedicated to housing students and those are also exceptional cases.

What about after university?

Working age is the largest class in age, despite this it looks like there is still very high concentration at the top:
Some sort of turistic village in the mountains?

  Late career shows a bit less of the issue but still have dominant spots:
A rural area, often associated with with the older median age.

What about the last 2 classes?


Retirement age still shows high concentration and again rural takes the top spot:
Retired but still in relatively good shape people tend to be distributed more in the countryside.




The older retirement segment seems to be more evenly split between rural and urban areas, with the higher urban concentrations likely being retirement homes:
A retirement home, in this case a church-managed building
Even taking age ranges, it looks like most extremes lead to either low population density areas (rural settings for older ages) or areas that tend to be the exception rather than the norm (student houses and turistic facities).

While there might be still some inaccuracies in the data, our basic filtering led us to find a mix of conditions that are not erroneous yet represents exceptions that might be useful as warnings and might warrant separate labeling depending on the use case.

This concludes our article on age segmentation, stay tuned for our next chapter on educational levels!

No comments:

Post a Comment