The information on this Github is part of the materials for the subject High Performance Data Processing (SECP3133). This folder contains general Exploratory Data Analysis (EDA) information as well as EDA case studies using Malaysian datasets. This case study was created by a Bachelor of Computer Science (Data Engineering), Universiti Teknologi Malaysia student.
Exploratory data analysis (EDA) involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate and learn, as opposed to confirming statistical hypotheses.
When do I use it?: Exploratory data analysis is a powerful way to explore a data set. Even when your goal is to perform planned analyses, EDA can be used for data cleaning, for subgroup analyses or simply for understanding your data better. An important initial step in any data analysis is to plot the data.
- developers.google: Good Data Analysis
- Towardsdatascience: What is Exploratory Data Analysis?
- Wikipedia: Exploratory data analysis
- r4ds: Exploratory Data Analysis
- careerfoundry:What Is Exploratory Data Analysis?
- Exploratory Data Analysis Tutorial | What Is EDA | How EDA Works | EDA In Python | Intellipaat * * * * *
No | Title | Colab | GitHub |
---|---|---|---|
1 | Introduction to Exploratory Data Analysis | ||
2 | Exploratory data analysis in Python | ||
3 | Housing Dataset | ||
4 | Exploring data and missing values |
Your submission will be evaluated using the following criteria:
- Dataset must contain at least 5 columns and 1500 rows of data
- You must ask and answer at least 5 questions about the dataset
- Your submission must include at least 5 visualizations (graphs)
- Your submission must include explanations using markdown cells, apart from the code.
- Your work must not be plagiarized i.e. copy-pasted from somewhere else.
Follow this step-by-step guide to work on your project.
- The Malaysian dataset must be used for your case study.
- The dataset is available at:
- Load the dataset into a data frame using Pandas
- Explore the number of rows & columns, ranges of values etc.
- Handle missing, incorrect and invalid data
- Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.)
- Compute the mean, sum, range and other interesting statistics for numeric columns
- Explore distributions of numeric columns using histograms etc.
- Explore relationship between columns using scatter plots, bar charts etc.
- Make a note of interesting insights from the exploratory analysis
- Ask at least 4 interesting questions about your dataset
- Answer the questions either by computing the results using Numpy/Pandas or by plotting graphs using Matplotlib/Seaborn
- Create new columns, merge multiple dataset and perform grouping/aggregation wherever necessary
- Wherever you're using a library function from Pandas/Numpy/Matplotlib etc. explain briefly what it does
- Write a summary of what you've learned from the analysis
- Include interesting insights and graphs from previous sections
- Share ideas for future work on the same topic using other relevant datasets
- Share links to resources you found useful during your analysis
- Upload your notebook to github.
Refer to these projects for inspiration: