🌟 Hit star button to save this repo in your profile
The objective of this assignment is to perform Exploratory Data Analysis (EDA) on a large dataset using big data tools and techniques. EDA is a critical step in understanding the characteristics of a dataset and uncovering insights that can inform further analysis and decision-making.
- Choose a large dataset (< 1 GB size file or <1 million data) that aligns with your interests or the project's requirements. It should be substantial in size to make use of big data tools and techniques. You can obtain datasets from various sources, such as public data repositories, Kaggle, government websites, or your project dataset.
- Obtain the selected dataset in a format that can be processed using big data tools. Common formats include CSV, Parquet, JSON, or databases compatible with big data frameworks.
- Make sure you have access to a big data environment. Install the necessary tools and libraries.
- If required, clean the dataset by handling missing values, removing duplicates, and addressing any data quality issues.
-
Perform the following EDA tasks using big data tools:
a. Summary Statistics: Compute basic statistics such as mean, median, standard deviation, and quantiles for relevant numerical variables.
b. Data Visualization: Create visualizations like histograms, box plots, scatter plots, and heatmaps to understand data distributions, correlations, and outliers.
c. Data Exploration: Explore the dataset's structure and identify any patterns, trends, or anomalies. Pay attention to variables' distributions, relationships, and potential insights.
d. Feature Engineering: If applicable, create new features or transform existing ones to better support your analysis.
- Document your analysis, including the tools, libraries, and scripts used. Explain the key findings and insights you derived from the EDA.
- Prepare a concise presentation of your EDA findings. Use visual aids and clear explanations to communicate your insights effectively.
- Submit your analysis report, code/scripts, and presentation to your instructor as specified in the assignment submission guidelines.
- Make sure to use big data tools efficiently to handle large datasets.
- Pay attention to data privacy and ethics, especially when dealing with sensitive information.
- Collaborate with classmates or seek help from your instructor if you encounter challenges during the assignment.
Your assignment will be assessed based on the quality of your EDA, the insights gained, documentation, and presentation.
If you have any questions or need clarification on any part of this assignment, please don't hesitate to reach out to your instructor for guidance. Good luck with your Exploratory Data Analysis using big data!
🚀 Form project teams comprising a minimum of three and a maximum of four students. Teamwork is essential for this assignment. Please complete the Google Sheets page with your group information here.
🚫 Uphold the highest standards of academic integrity. Any candidate suspected of cheating in the assignment will face disciplinary action, which may include suspension or expulsion from the University. Moreover, any materials or devices found to be in violation of examination rules and regulations will be confiscated.
📝 Prepare a comprehensive document that outlines the step-by-step process for creating the case study. The deadline for submission is 26 November 2023, at 5:00 PM. Late submissions will not be accepted and will be disregarded.
You must place your file in the submission folder. Within the bdm/
folder, create a folder called your group. Name the default file as readme.md
. Suggested folder structure for this project:
bdm/your_group/
├── 📄 ass3.ipynb
├── 📄 readme.md
└── 📄 report.md
No | Group | Dataset | File |
---|---|---|---|
0. | Sample | GTZAN: Music Genre Classification | ![]() |
1. | RAM | Water Quality Prediction | ![]() |
2. | Avengers | 10+ M. Beatport Tracks / Spotify Audio Features | ![]() |
3. | Ayam Rendang | Airline Delay and Cancellation Data, 2009 - 2018) | ![]() |
4. | Truth Archive | NYC Yellow Taxi Trip Data | ![]() |
5. | F4 | YouTube Trending Video Dataset | ![]() |
6. | TheBoys | Brooklyn Home Sales | ![]() |
7. | KicapSambal | Amazon Books Reviews | ![]() |
Please create an Issue for any improvements, suggestions or errors in the content.
You can also contact me using Linkedin for any other queries or feedback.