Skip to content

Data wrangling exercises covered in Udacity's Data Analyst Nanodegree (Module 4)

License

Notifications You must be signed in to change notification settings

marcellovictorino/DAND_4_Data_Wrangling

Repository files navigation

Data Wrangling

Data wrangling exercises covered in Udacity's Data Analyst Nanodegree (Module 4).

Data Wrangling is an essential skill for Data Science, since you cannot have advanced Machine Learning modeling techniques built on top of "messy data".

It can be divided in 3 main tasks:

  1. Gather: acquiring/collecting data and importing that data into your programming environment. Examples: downloading a file, scraping a web page, querying an API etc.

  2. Asses: evaluate data quality and tidiness, identifying what needs fixing.

    • Quality: low quality data = dirty data. Issues with content, such as: missing, invalid (impossible values), inconsistent data (different units). Data should be clean enough to serve its purpose - hence it depends on what is is going to be used for.
    • Tidiness: untidy data = messy data. Issues with structure that should be addresses in order to facilitate analysis, where:
      1. Each variable forms a column;
      2. Each observations forms a row; and
      3. Each type of observational unit forms a table.
  3. Clean: actions to be taken, according to the previous data assessment, to improve data quality and make the structure properly tidy. This task should be broken down into three parts:

    • Define: a clear action plan - in writing. This "Cleaning Plan" serve as an instruction list for reproducibility.
    • Code: translate action plan from words into executable and efficient code.
    • Test: assert the cleaning operations performed as intended.

Note: not to be confused with Exploratory Data Analysis (EDA). As a matter of fact, Data Wrangling is all about getting everything ready in order to explore the dataset, looking at descriptive statistics and charts.

This repository contains exercises and small projects focusing on each of the main tasks of Data Wrangling.

Project List

1) Armenian Online Job Posting database The [dataset](https://www.kaggle.com/udacity/armenian-online-job-postings) consists of 19,000 job postings between 2004 - 2015, with 24 Columns, full of string descriptions instead of simple categorical values.
2) Rotten Tomatoes: 100 best movies This project focus on Data Gathering, using Beautiful Soup to parse HTML files to extract Critics and Audience Rating; Requests library to access url and save data locally: both text and image (using PIL.Image and io.BytesIO) - storing text reviews from Roger Ebert website and Movie Poster images from MediaWiki. Lastly, all datasets are merged to generate rating visualizations and themed WordCloud based on movie review over the poster image.
3) Project 4: WeRateDogs Twitter This project is part of a requirement to graduate in the Udacity's Data Analyst Nanodegree (DAND).
It provides the opportunity to implement Data Wrangling in practice by gathering data from different sources, assessing it for quality and tidiness issues and then promote the necessary cleaning task - programmatically.
Finally, once the data is properly cleaned and stored as two csv files, a brief analysis is conducted with visualizations, highlighting interesting insights.
The data for this project was provided in partnership with the WeRateDogs channel from twitter, containing over 2,300 observations about dogs.

About

Data wrangling exercises covered in Udacity's Data Analyst Nanodegree (Module 4)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published