Data wrangling exercises covered in Udacity's Data Analyst Nanodegree (Module 4).
Data Wrangling is an essential skill for Data Science, since you cannot have advanced Machine Learning modeling techniques built on top of "messy data".
It can be divided in 3 main tasks:
-
Gather: acquiring/collecting data and importing that data into your programming environment. Examples: downloading a file, scraping a web page, querying an API etc.
-
Asses: evaluate data quality and tidiness, identifying what needs fixing.
- Quality: low quality data = dirty data. Issues with content, such as: missing, invalid (impossible values), inconsistent data (different units). Data should be clean enough to serve its purpose - hence it depends on what is is going to be used for.
- Tidiness: untidy data = messy data. Issues with structure that should be addresses in order to facilitate analysis, where:
- Each variable forms a column;
- Each observations forms a row; and
- Each type of observational unit forms a table.
-
Clean: actions to be taken, according to the previous data assessment, to improve data quality and make the structure properly tidy. This task should be broken down into three parts:
- Define: a clear action plan - in writing. This "Cleaning Plan" serve as an instruction list for reproducibility.
- Code: translate action plan from words into executable and efficient code.
- Test: assert the cleaning operations performed as intended.
Note: not to be confused with Exploratory Data Analysis (EDA). As a matter of fact, Data Wrangling is all about getting everything ready in order to explore the dataset, looking at descriptive statistics and charts.
This repository contains exercises and small projects focusing on each of the main tasks of Data Wrangling.
1) Armenian Online Job Posting database
The [dataset](https://www.kaggle.com/udacity/armenian-online-job-postings) consists of 19,000 job postings between 2004 - 2015, with 24 Columns, full of string descriptions instead of simple categorical values.2) Rotten Tomatoes: 100 best movies
This project focus on Data Gathering, usingBeautiful Soup
to parse HTML files to extract Critics and Audience Rating; Requests
library to access url and save data locally: both text and image (using PIL.Image
and io.BytesIO
) - storing text reviews from Roger Ebert website and Movie Poster images from MediaWiki. Lastly, all datasets are merged to generate rating visualizations and themed WordCloud based on movie review over the poster image.
3) Project 4: WeRateDogs Twitter
This project is part of a requirement to graduate in the Udacity's Data Analyst Nanodegree (DAND).It provides the opportunity to implement Data Wrangling in practice by gathering data from different sources, assessing it for quality and tidiness issues and then promote the necessary cleaning task - programmatically.
Finally, once the data is properly cleaned and stored as two
csv
files, a brief analysis is conducted with visualizations, highlighting interesting insights. The data for this project was provided in partnership with the WeRateDogs channel from twitter, containing over 2,300 observations about dogs.