Every Data Science project starts with Data Preparation.
While it may be considered a fairly mundane task, it is instead a fundamental step to ensure proper and coherent results and avoid inconsistent performance in your future data modeling.
As a first example, we will start with a very user-friendly data set, the 2011 Italian Census: ISTAT Open Data.
This is a solid data source, with data that has been collected and verified by statisticians, formatted as text tables. It is also a great place to start for Data Science beginners as it uses a great variety of structured data about something that can be easily related with: people and geography.
In this post we will use an extremely user-friendly tool, able to cover most common data science use cases: Alteryx Designer.
With a 14 day trial available, it should be enough start familiarizing with the product and make a small project with it, make sure you install the R predictive package as well.
As first step, we'll download the "sezioni di censimento", which are the smallest aggregation available. As you can notice, they are both divided per region and most of the header file is not in a meaningful format, but there is a separate "tracciato_2011_sezioni" which actually has the header descriptions.
Once done, we should have a list of files similar to this:
Time to open Alteryx and start data wrangling!
First of all, let's drag and drop the "tracciato_2011_sezioni" file in Alteryx.
To be properly read, you will need to set the column separator, field length and charset:
This file will give you a list of field names in the first colum with their description in the second one.
Let's now drag the "R01_indicatori_2011_sezioni" as well and set the separator to ";".
It will load the first Italian region along with all the fields, but what about the other 19?
Alteryx has an useful feature that allows you to define wildcards in its input in order to load multiple files as a single input. By using "R*_indicatori_2011_sezioni" you will be able to load the entire territory with a single input action. Be careful! This works only when files have the exact same field format and names, otherwise it will load only the first one or the ones that match the first.
If you try running the flow with CTRL+R you can verify the loading and by clicking on the green triangle next to the input file you should be able to see a preview:
As you can see, lots of colums, but not very user friendly names. Wouldn't it be great to use the definitions from the other file in a simple way?
The Crosstab tool allows you to dinamically transpose rows into columns, giving us the possibility of having the descriptions neatly arranged and ready for merging:
What we need now is to actually make it the header of our files, how to do it?
For this basic article, we'll use a quick and dirty way, using the Union tool:
Using the "Auto Config By Name" option, the Union tool will match the columns in the correct order, while we'll make sure the new header rows will be on top by setting the specific output order.
To finalize this, we'll also save the output as a CSV file by using the output tool, creating a single database.
This of couse is not completely correct, as we still have the old headers, followed by the ones we want.
How to fix this?
Let's start by importing the file we just created, this time using the 2nd row as field names:
This will finally give us the complete database with descriptively named fields, although the column order is not exactly neatly adjusted. All variables are also strings, while it's clear that most of them should be integers instead.
In theory, a good data scientist should analyize each single column and decide what is the most appropriate data type, however as we have so many of them we'll use a new nifty time-saver Alteryx feature.
First of all, in the "Preparation" tab you will find a tool named "Auto Field". It does exactly what it says on the tin: by looking at the data in each columns it will try to assign the most efficent field type.
Let's use it with the default settings and do the final touches using another tool from the Preparation tab, called "Select".
The Select tool is something that is going to be used very often as it allows to do some extremely useful things:
- Rename fields
- Change field type
- Change field order
- Give the possibility to "comment" a field, useful when some data needs further explanation
As the source file was clean to start with and we let Alteryx choose the field types for us not much is needed, so what I suggest in this case is to identify the various Key fields, rename them in a more concise way and move them in a more useful position:
Let's wrap this tutorial up by saving the new file as Alteryx database using the output tool and running the workflow.
We now have a clean and typized database instead of a mix of poorly labeled plain CSV files, ready to be worked on, and in our case perfect as enrichment base for our next tutorial!
No comments:
Post a Comment