Sunday, 10 February 2019

Digging deeper in the census data part 3: Determining Education Levels

Hello and welcome to Open Citizen Data Science!

In this article we will look into the census variables related to educational levels.
Strictly related to the demographic segments we treated in our previous article, education is an important variable in defining demographic segments, especially in Italy where university graduates are relatively fewer compared to EU averages.





Let's do a quick recap of the available educational variables:

- Popolazione residente con laurea vecchio e nuovo ordinamento + diplomi universitari + diplomi terziari di tipo non universitario vecchio e nuovo ordinamento: university educated
- Popolazione residente con diploma di scuola secondaria superiore (maturità + qualifica): high school
- Popolazione residente con media inferiore: middle school
- Popolazione residente con licenza elementare: elementary school
- Popolazione residente - alfabeti: no title - can read and write
- Popolazione residente - analfabeti: illiterate

How is this related to age groups?


- Popolazione residente con laurea vecchio e nuovo ordinamento + diplomi universitari + diplomi terziari di tipo non universitario vecchio e nuovo ordinamento:  either over 22 or over 24 years old
 - Popolazione residente con diploma di scuola secondaria superiore (maturità + qualifica): generally over 19 years old, in some cases over 17
- Popolazione residente con media inferiore: over 14 years old
- Popolazione residente con licenza elementare: over 11 years old
- Popolazione residente - alfabeti: could be assumed over 5 years old
- Popolazione residente - analfabeti: illiterate

Let's see how an unadjusted distribution would work:


Educational levels look pretty dire this way, but is it a fair way to look at it, given that a fair share of the population is younger than the degree levels?
Let's try again by checking the percentage against the effective eligible population:



Still not exactly stellar but a bit better. Let's see how this behaves on local level:


As educational levels are strictly related to age, we will start using some filters we previously applied, namely we will filter out areas with less than 10 people in it and also ones without habitative units inhabited by residents.

Not residential

The most educated area in Italy appears to be an University. Not an error, but not what you would call residential or a typical area.
Can we find a way to filter this kind of anomaly? Let's see how the population is distributed:


1 habitative unit with 18 residents, more residentual buildings than habitative units with residents.
This last factor might indicate a place with a high number of non-residents compared to residents, so let's set that habitative units with at least a resident inside must be >= the number of residential buildings in the area and we'll select the top as % of university degree over eligible population with % of university degree over total population as secondary criteria in case of tie:

Definitely residential

We hit the center of Pavia, not far from its university. 100% of those that are in the age of having a university degree own it and over 90% of the total population.

What about the opposite, the area with the lowest education level?
Let's try same as before but with illiterate population:

Again, not residential. Religious structures so far are proving to be a major data distortion.

A religious structure. A quick google search shows that it's apparently a school.
With 75% illiterate adults and no resident children it's another anomaly, which we should find a way to fix:


Looks like we have the same problem as in a few articles ago, when we tried to find highly populated areas: a lot of people in a single habitative unit, which should be equivalent to an apartment.
This seems to be pretty common in religious structures, so let's try to find what is a more reasonable threshold in our dataset:


Looks like that this many people in an habitative unit are almost double the 999/1000 case, so definitely an outlier. Let's see what happens setting our upper threshold there:

A small village in the mountains, definitely a rural area.
This looks valid, although areas with low education seems to be pretty varied.

Looking at all those outliers one might question the data. While legitimate, I'd say this shows the limits of traditional data collection.
A Census is a process with a long, established tradition and in a way it's the ancestor of Big Data.

Gathering information about literally millions of people by interviews and a multitude of data sources is a very complex tasks that is done by professionals, yet in many ways it's pretty much raw data.

There are 402678 census area in our starting dataset, 366863 of which have some kind of data about families. After all our filtering we narrowed down to 254677 areas, 63% of the total census areas and 69% of areas with family data.

Looking at it in pure area, we narrowed down from over 300000 square kilometers (of which 75% have data about families) to less than 100000, or less than a third of the starting area.
While this may sounds like a major loss of information, in reality it's a lot less dramatic:


Most data loss is concentrated in sparsely populated, mountainous or mainly touristic areas with few residents, while everywhere else it's mostly small "holes" surrounded by areas with information, meaning that one could choose to "average the area" or other solutions to close the gap if needed.

In terms of population, we go from 59433744 people to 53505596, roughly a 10% loss mostly caused by areas where non-residents are prevalent.

This concludes our exploration on educational levels, stay tuned for our next article!

No comments:

Post a Comment