Don't forget to hit the ⭐ if you like this repo.
The information on this Github is part of the materials for the subject High Performance Data Processing (SECP3133). This folder contains general big data information as well as big data case studies using Malaysian datasets. This case study was created by a Bachelor of Computer Science (Data Engineering), Universiti Teknologi Malaysia student.
Big data processing involves the systematic handling and analysis of vast and complex datasets that exceed the capabilities of traditional data processing methods. It encompasses the storage, retrieval, and manipulation of massive volumes of information to extract valuable insights. Key steps include data ingestion, where large datasets are collected from various sources, and preprocessing, involving cleaning and transformation to ensure data quality. Advanced analytics, machine learning, and data mining techniques are then applied to uncover patterns, trends, and correlations within the data. Big data processing is integral to informed decision-making, enabling organizations to derive meaningful conclusions from their data, optimize operations, and gain a competitive edge in today's data-driven landscape.
Big Data processing with Pandas, a powerful Python library for data manipulation and analysis, involves implementing strategies to handle large datasets efficiently. Scaling to sizable datasets requires adopting techniques such as processing data in smaller chunks using the 'chunksize' parameter in Pandas read_csv function. This approach facilitates reading and processing large datasets in more manageable portions, preventing memory overload. To further optimize memory usage, it's essential to leverage Pandas' features like data types optimization, using more memory-efficient data types when possible. Additionally, utilizing advanced functionalities like the 'skiprows' parameter and filtering columns during data import can significantly enhance performance. By mastering these strategies, one can effectively manage and analyze vast datasets in Python with Pandas, ensuring both computational efficiency and memory optimization in the face of Big Data challenges.
- Top 10 Python Libraries Data Scientists should know
- Top 5 Python Libraries For Big Data
- Python Pandas Dataframe Tutorial for Beginners
- 4 strategies how to deal with large datasets in Pandas
- Scaling to large dataset
- 3 ways to deal with large datasets in Python
- Reducing Pandas memory usage
- How To Handle Large Datasets in Python With Pandas
- Efficient Pandas: Using Chunksize for Large Datasets
- How did I convert the 33 GB Dataset into a 3 GB file Using Pandas?
- Video: How to work with big data files (5gb+) in Python Pandas!
- Loading large datasets in Panda
- Video: How to Read Very Big Files With SQL and Pandas in Python
- Scaling to large datasets
- Video: How to Handle Very Large Datasets in Python Pandas (Tips & Tricks)
- Video: 3 Tips to Read Very Large CSV as Pandas Dataframe
- Kaggle: Largest Datasets
- EDA for Amazon books reviews
- 8 Alternatives to Pandas for Processing Large Datasets
- Tutorial compilation for handling larger datasets
- Modin
- Github Modin
- How to Speed Up Pandas with Modin
- Kaggle: Speed up Pandas Workflow with Modin
- Video: Do these Pandas Alternatives actually work?
- Video - Dask: An Introduction
- Dask | Scale the Python tools you love
- Dask – How to handle large dataframes in python using parallel computing
- Dask (software)
- Parallel Computing with Dask: A Step-by-Step Tutorial
- Faster Pandas with parallel processing: cuDF vs. Modin
- Scaling Interactive Data Science with Modin and Ray
- Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS
- 7 Amazing companies that really get big data
- Data Science Case Studies: Solved using Python
- 10 Real World Data Science Case Studies Projects with Example
- Top 8 Data Science Case Studies for Data Science Enthusiasts
Pandas
- Lab 1: 1,000,000 Sales Records
- Lab 2: NYC Yellow Taxi Trip Data
- Lab 3: NYC Taxi Trip Duration EDA notebook
- Lab 4: Strategies to Deal With Large Datasets Using Pandas
- Lab 5: eCommerce behavior data from multi category store (285 million users)
Modin
- Lab 1: How to use Modin
- Lab 2: Speed improvements
- Lab 3: Not Implemented
- Lab 4: Experimental Features
- Lab 5: Modin for Distributed Pandas
Dask
- Lab 1: Introducing Dask
- Lab 2: Loading Data Into DataFrames
- Lab 3: Introducing Dask DataFrames
- Lab 4: Learning Dask With Python Distributed Computing
- Lab 5: Parallelize code with dask.delayed
Comparison between libraries
Please create an Issue for any improvements, suggestions or errors in the content.
You can also contact me using Linkedin for any other queries or feedback.