Data preprocessing (also known as Data Preparation, but "Preprocessing" sounds more like magic) is the iterative process of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics, and machine learning applications.
Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.
It's the core ability of any data scientist or data engineer, and you must be able to manipulate, clean, and structure your data during the everyday work (besides expecting that this will take the most of your daily-time!).
There are a lot of different data types out there, and they deserve different treatments.
As usual the structure I've planned to get you started consists of having a general overview, and then dive deep into each data processing situation you can encounter.
Here you have a gentle end-to-end panoramic view of the entire process.
The concepts through which we're going are the following:
- Don't Joke With Data
- Business Questions
- Data Profiling
- Who To Leave Behind
- Start Small
- The Toolkit
- Data Cleaning
- Get Rid of Extra Spaces
- Select and Treat All Blank Cells
- Convert Values Type
- Remove Duplicates
- Change Text to Lower/Upper Case
- Spell Check
- Dealing with Special Characters
- Normalizing Dates
- Verification To Enrich Data
- Data Discretization
- Feature Scaling
- Data Cleaning Tools
- Merge Data Sets and Integration
- Sanity Check
- Automate These Boring Stuffs!
Let's Start!
First, data is King. In the data-driven epoch, having data quality issues means to lose tremendous amounts of value for a company, in the present and in the future. So, respect your King and care a lot about him. The most immediate way to do this is to plan and work hard to produce good quality data. Your goal is to plan a collecting data infrastructure that fixes problems beforehand. This means to care to a lot about planning well your database schemas (do I need third-normal form or not?), how do you collect data from sensors (physical or conceptual) and so on. These are problems if you're building a system up from the ground, but most of the times in you're gonna facing real-world problems that someone wants to solve with already available data.
Asking the right business questions is hard, but it has the biggest impact on your performance of solving a particular problem. Remember, you want to solve a problem, not to create new ones!
According to the (cold as ice) Wikipedia definition: "Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics and informative data summaries."
So Wikipedia is subtly suggesting us to take a coffee with the data.
During this informal meeting, ask the data questions like:
- which business problem are you meant to solve? (what is important, and what is not)
- how have you been collected (with noise, missing values...)?
- how many friends of yours are there and where can I find them? (data dimensions and retrieving from storages)
Eventually, you may find the data too much quiet, maybe they're just shy!
Anyway, you're going to ask these questions to the business user!
Best practices and exercises: 1, 2
During the data profiling process, it's common to realize that often some of your data are useless. Your data may have too much noise or they are partial, and most likely you don't all of them to answer your business problems. To drop or not to drop, the Dilemma. Each time you're facing a data related problem, try to understand what data you need and what you' don't - that is, for each piece of information, ask yourself (and ask the business user):
- How this data is going to help me?
- Is possible to use them, reducing noise o missing values?
- Considering the benefits/costs of the preparation process versus the business value created, Is this data worth it?
It's stupid to handle GBs of data each time you want to try a data preparation step. Just use small subsets of the data (but take care that they are representative and you catch all the problems). Remember, if you want to experiment with string cleaning, you don't need to launch your script on 10M rows.
The tools we're gonna use are Python3 and his Pandas library, the de-facto standard to manipulate datasets. The heavy lifting here is done by the DataFrame class, which comes with a bunch of useful functions for your daily data tasks. Hopefully, you already know Python, if not start from there (do the steps I suggest you in the ML guide requirements), and then take this Beginner Pandas tutorial. Don't worry if now some ideas are not totally clear, but try to get the big picture of the common Pandas operations.
Best practices and exercises: 1, 2, 3, 4, 5
Data cleaning is the general process of taking data, after you have a clear big picture of them, and you need to realize the actual process of replacing characters, dropping incomplete rows, fill missing values and so forth. In the next sections, we'll explore all the common data cleaning situations.
One of the first things you want to do is remove extra spaces. Take care! Some space can carry information, but it heavily depends on the situation. For example, in "Complete Name": "Giacomo Ciarlini" in nice to have space so we can later split this into "Name": "Giacomo" and "Surname": "Ciarlini". I want you to notice that in general, apart from recommending and suggestion customization systems, unique identifiers like names or IDs are something you can generally drop. Often, they do not carry information. Bonus tip: learn how to use Regex for pattern matching, this is one of the powerful tools each data guy need to master.
Best practices and exercises: 1, 2, 3
Bonus Resource: A super useful tool for visualizing RegeX expressions and their effect on the text.
Often real-world data is incomplete and is necessary to handle this situation. These are two ways of dealing with it. Here you have a more in-depth tutorial.
Best practices and exercises: 1, 2, 3, 4
Different data types carries different information, and you need to care about this. Here is a good tutorial on how to convert type values. Remember that Python has some shortcut for doing this (executing str(3) will give you back the "3" string) but I recommend you to learn how to do it with Pandas.
You don't want to duplicate data, they both are noise and occupy space! Learn how to handle them simply with Pandas.
You want to Capitalize names, or maybe make them uniform (some people can enter data with or without capital letters!). Check here for the Pandas way to do it.
You want to correct wrong words, for the sake of evenness. Check here for a good Python module to do it. Also, this is a good starting point to implement it.
Best practices and exercises: 1, 2, 3
Maybe you're going to feed your data into a neural network or show them in a colorful bars plot. Anyway, you need to transform your data and give them the right shape for your data pipeline. Here is a very good tutorial for this task.
Best practices and exercises: 1, 2.
UTF-encoding is the standard to follow, but remember that not everyone follows the rules (otherwise, we'd not need crime predictive analytics. You can learn here how to deal with strange accents or special characters.
Best practices and exercises: 1, 2, 3
I think there could be one hundred ways to write down a date. You need to decide your format and make them uniform across your dataset, and here you learn how to do it.
Best practices and exercises: 1, 2, 3
Sometimes can be useful to engineer some data, for example: suppose you're dealing with e-commerce data, and you have the prices of each object sold. You may want to add a new column in your dataset, with a label carrying handy information like a Price_level [low, medium, high] based on upper and lower bounds you can decide. This is really simple in Pandas, check here. Another example is to add a Gender column (M, F) to easily explore data and gain insights in a customers dataset.
Best practices and exercises: 1, 2, 3
Many Machine Learning and Data Analysis methods cannot handle continuous data, and dealing with them can be computationally prohibitive. Here you find a good video explaining why and how you need to discretize data.
Best practices and exercises: 1, 2, 3
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. [Here](Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.) you find a serious tutorial about this fundamental step.
Best practices and exercises: 1, 2, 3
You're not going to hunt tigers without a rifle! You have a ton of tools out there that will help you during the data cleaning process, the one I want to suggest you is this open source tool from Google. Check here for more.
Now that you hopefully have been successful in your data cleaning process, you can merge data from different source to create big de-normalized data tables, ready to be explored and consumed. This is why.
Best practices and exercises: 1, 2, 3, 4, 5
You always want to be sure that your data are exactly how you want them to be, and because of this is a good rule of thumb to apply a sanity check after each complete iteration of the data preprocessing pipeline (i.e. each step we have seen until now) Look here for a good overview. Depending on your case, the sanity check can vary a lot.
Best practices and exercises: 1, 2, 3
As I told you at the very beginning, the data preprocessing process can take a long time and be very tedious. Because of this, you want to automate the most you can. Also, automation is married with iteration, so this is the way you need to plan your data preprocessing pipelines. Here you find a good command line tool for doing that, but I'm almost sure you'll need to build your own (remember, each problem is unique!), but this is a good starting point.
Best practices and exercises: 1, 2, 3, 4, 5
Now you're ready to take your data and play with them in a variety of ways, and you have a nice panoramic overview of the entire process. You can refer to this page when you clean data, to check if you're not missing some steps. Remember that probably each situation requires a subset of these steps.