The information on this Github is part of the materials for the subject High Performance Data Processing (SECP3133). This folder contains general big data information as well as big data case studies using Malaysian datasets. This case study was created by a Bachelor of Computer Science (Data Engineering), Universiti Teknologi Malaysia student.
Your submission will be evaluated using the following criteria:
- Dataset must contain at least larger than 100MB
- Please implement data processing related to the concept of big data.
- You must ask and answer at least 5 questions about the dataset
- Your submission must include explanations using markdown cells, apart from the code.
- Your work must not be plagiarized i.e. copy-pasted from somewhere else.
Follow this step-by-step guide to work on your project.
- The dataset is available at:
- Write a summary of what you've learned from the analysis
- Include interesting insights and graphs from previous sections
- Share links to resources you found useful during your analysis
Pandas library has became the de facto library for data manipulation in python and is widely used by data scientist and analyst. However, there are times when the dataset is too large and Pandas may run into memory errors. Here are 8 alternatives to Pandas for dealing with large datasets. For each alternative library, we will examine how to load data from CSV and perform a simple groupby operation. Fortunately many of these libraries have similar syntax as Pandas hence making the learning curve less steep.
- Data Table
- Polars
- Vaex
- Pyspark
- Koalas
- cuDF
- Dask
- Modin
This case study is divided into two parts:
- Case Study: 2a
- Please use the appropriate dataset.
- You need to carry out an explanation related to the basic concept of the library.
- Please show the code step by step of its implementation.
- Case Study: 2b
- You are required to compare Pandas with the selected library
- Make sure you use the same dataset when making comparisons.
- You can also use visualization to show the comparison.
Team | Title | Colab | GitHub |
---|---|---|---|
1 | DataTable | ||
2 | Polars | ||
3 | Vaex | ||
4 | Pyspark | ||
5 | Koalas | ||
6 | cuDF | ||
7 | DataTable | ||
8 | Polars | ||
9 | Vaex | ||
10 | Pyspark | ||
11 | Koalas |
- You need to use a dataset that is larger than 1 GB. You can get the dataset from Kaggle or Dataset Search. The dataset file must be of CSV type.
- The dataset must be stored in Google Drive.
- Make sure you create a link to enable your dataset to be used on Google Colab.
- Please create operations related to big data that allow the dataset to be used.
- You need to use at least three libraries related to big data processing such as Pandas, Dask, Vaex and Modin.
- Please compare the processing results from the selected libraries.
- You need to use the concept of Exploratory Data Analysis (EDA) on this project.
Team | Libraries for data science | Colab | GitHub |
---|---|---|---|
1 | DataTable | ||
2 | Polars | ||
3 | Vaex | ||
4 | Pyspark | ||
5 | Koalas | ||
6 | cuDF | ||
7 | DataTable | ||
8 | Polars | ||
9 | Vaex | ||
10 | Pyspark | ||
11 | Koalas |