This repository contains my project notebooks for the Distributed Systems and Cloud Computing.
In this introductory laboratory:
- Acquire Pandas, Matplotlib, Numpy
- Gain familiarity with the PySpark and how to interact with the HDFS
- Gain hands-on experience with two Distributed codes:
- Word Count
- Analysis of night flights
This notebook contains multiple implementations of the gradient descent algorithm.
- Generate synthetic data for a regression problem, and use Scipy lib to obtaine a baseline regression result.
- A Numpy implementation of Batch Gradient Descent.
- A Numpy implementation of Stochastic Gradient Descent and Mini-Batch Stocastic Gradient Descent.
- A Distributed version of Mini-Batch Gradient Descent implemented using PySpark.
Beside i also compare those 4 implementations with the baseline from Scipy. A deep analysis is performed regarding how each algorithm's results change with parameters such as the learning rate and the number of iterations. Finally an analysis of the performance of all the algorithms in terms of dataset size.
This notebook is mainly about implementation and analysis the k-means algorithm on a 2D dataset:
- A Numpy implementation of Vanilla k-means.
- Analysis of algorithm convergence by Heterogeneity (total within Sum of Squares), and implementation the k -means++ technique for smart centroid initialization and allows to improve the quality of the local optima.
- Implement the Elbow Method to help find the optimal value of the number of clusters.
- A Distributed version of k-means is implemented with PySpark, the distributed version is analyzed and compared with the serial implementation.
This notebook is mainly about study the DataFrames API in Spark that allow developers to impose a structure on a distributed collection of data, allowing higher-level abstraction. We learn how to use DataFrame API and its advantages to manipulate distributed, structured data.
A hand-on experience is the analysis of flights data using SparkSQL. Data exploration is divided in three main sections: basic queries, flight volume statistics and additional queries. Data visualization is performed using the seaborn module.