Spark Scheduler Comparison and Algorithm Performance Analysis

This repository contains an analysis of different Spark schedulers and the performance of various algorithms implemented in Spark MLlib. We conducted experiments using different Spark schedulers including FIFO and Fair, and measured the execution time of algorithms such as FP Growth and Random Forest. Additionally, we incorporated hyperparameter tuning for these algorithms using grid search cross-validation.

Overview

In distributed computing frameworks like Spark, efficient job scheduling is crucial for optimizing resource utilization and improving overall performance. Spark provides different schedulers, each with its own scheduling policies and strategies. Understanding how these schedulers impact job execution time and resource allocation can provide valuable insights for optimizing Spark applications.

Moreover, the choice of algorithm and its parameter settings can significantly affect the performance of machine learning tasks. In this analysis, we focused on popular algorithms available in Spark MLlib and evaluated their performance under different scheduling configurations.

Project Architecture

Experiments

Spark Schedulers

FIFO Scheduler: The First-In-First-Out (FIFO) scheduler is the default scheduler in Spark. It schedules jobs in the order they are submitted to the cluster, without considering resource availability or job priority.
Fair Scheduler: The Fair Scheduler aims to provide fair distribution of resources among users and applications. It dynamically adjusts resource allocation based on the demand from different jobs or users.

Algorithms

FP Growth: FP Growth is a frequent pattern mining algorithm used for mining frequent itemsets from transaction data efficiently. We evaluated its performance under different scheduling configurations.
Random Forest: Random Forest is an ensemble learning method used for classification and regression tasks. We measured its execution time and accuracy under various scheduling settings.

Hyperparameter Tuning

Hyperparameter tuning is essential for optimizing the performance of machine learning models. We employed grid search cross-validation to find the best combination of hyperparameters for FP Growth and Random Forest algorithms.

Conclusion and Future Work

This project compares Spark's FIFO and FAIR schedulers in hyperparameter tuning tasks for FP Growth and Random Forest algorithms. The FAIR scheduler consistently exhibits reduced execution times across different algorithms and cluster sizes, particularly as the number of worker nodes increases, underscoring its scalability and efficiency. Further study is needed to analyze the fair scheduler on a larger number of nodes in the Spark cluster. Additionally, efforts should be made to mitigate network overhead to obtain accurate results.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
1. FPGrowth_FIFO_HT.py		1. FPGrowth_FIFO_HT.py
2. FPGrowth_Fair_HT.py		2. FPGrowth_Fair_HT.py
3. RandomForest_FIFO.py		3. RandomForest_FIFO.py
4. RandomForest_Fair.py		4. RandomForest_Fair.py
Arch.png		Arch.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Scheduler Comparison and Algorithm Performance Analysis

Overview

Project Architecture

Experiments

Spark Schedulers

Algorithms

Hyperparameter Tuning

Conclusion and Future Work

About

Releases

Packages

Contributors 2

Languages

License

Sai-Kartheek-Reddy/Spark-Scheduler-Comparison

Folders and files

Latest commit

History

Repository files navigation

Spark Scheduler Comparison and Algorithm Performance Analysis

Overview

Project Architecture

Experiments

Spark Schedulers

Algorithms

Hyperparameter Tuning

Conclusion and Future Work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages