Don't forget to hit the ⭐ if you like this repo.
The information on this Github is part of the materials for the subject High Performance Data Processing (SECP3133). This folder contains general big data information as well as big data case studies using Malaysian datasets. This case study was created by a Bachelor of Computer Science (Data Engineering), Universiti Teknologi Malaysia student.
- Notes
- Big Data: Pandas
- Big Data: Alternatives to Pandas for Processing Large Datasets
- Modin
- Dask
- Datatable
- Comparison between libraries
- Big Data: Case study
- Lab
- Pandas
- Modin
- Dask
- Comparison between libraries
- Assignment
- Assignment1: Pandas - Data Processing
- [Assignment2: Alternatives to Pandas for Processing Large Datasets]
- Project
- Top 10 Python Libraries Data Scientists should know
- Top 5 Python Libraries For Big Data
- Python Pandas Dataframe Tutorial for Beginners
- 4 strategies how to deal with large datasets in Pandas
- Scaling to large dataset
- 3 ways to deal with large datasets in Python
- Reducing Pandas memory usage
- How To Handle Large Datasets in Python With Pandas
- Efficient Pandas: Using Chunksize for Large Datasets
- Video: How to work with big data files (5gb+) in Python Pandas!
- Loading large datasets in Panda
- Video: How to Read Very Big Files With SQL and Pandas in Python
- Scaling to large datasets
- Video: How to Handle Very Large Datasets in Python Pandas (Tips & Tricks)
- Video: 3 Tips to Read Very Large CSV as Pandas Dataframe
- Kaggle: Largest Datasets
- EDA for Amazon books reviews
- 8 Alternatives to Pandas for Processing Large Datasets
- Tutorial compilation for handling larger datasets
- Modin
- Github Modin
- How to Speed Up Pandas with Modin
- Kaggle: Speed up Pandas Workflow with Modin
- Video: Do these Pandas Alternatives actually work?
- Video - Dask: An Introduction
- Dask | Scale the Python tools you love
- Dask – How to handle large dataframes in python using parallel computing
- Dask (software)
- Parallel Computing with Dask: A Step-by-Step Tutorial
- Faster Pandas with parallel processing: cuDF vs. Modin
- Scaling Interactive Data Science with Modin and Ray
- Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS
- 7 Amazing companies that really get big data
- Data Science Case Studies: Solved using Python
- 10 Real World Data Science Case Studies Projects with Example
- Top 8 Data Science Case Studies for Data Science Enthusiasts
Pandas
- Lab 1: 1,000,000 Sales Records
- Lab 2: NYC Yellow Taxi Trip Data
- Lab 3: NYC Taxi Trip Duration EDA notebook
- Lab 4: Strategies to Deal With Large Datasets Using Pandas
- Lab 5: eCommerce behavior data from multi category store (285 million users)
Modin
- Lab 1: How to use Modin
- Lab 2: Speed improvements
- Lab 3: Not Implemented
- Lab 4: Experimental Features
- Lab 5: Modin for Distributed Pandas
Dask
- Lab 1: Introducing Dask
- Lab 2: Loading Data Into DataFrames
- Lab 3: Introducing Dask DataFrames
- Lab 4: Learning Dask With Python Distributed Computing
- Lab 5: Parallelize code with dask.delayed
Comparison between libraries
Please create an Issue for any improvements, suggestions or errors in the content.
You can also contact me using Linkedin for any other queries or feedback.