Skip to content

This repository contains notebook projects for Distributed Systems and Cloud Computing

Notifications You must be signed in to change notification settings

JZ-LIANG/Distributed-Systems-and-Cloud-Computing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Systems and Cloud Computing [CLOUDS]

This repository contains my project notebooks for the Distributed Systems and Cloud Computing.


Notebooks

In this introductory laboratory:

  • Acquire Pandas, Matplotlib, Numpy
  • Gain familiarity with the PySpark and how to interact with the HDFS
  • Gain hands-on experience with two Distributed codes:
    • Word Count
    • Analysis of night flights

This notebook contains multiple implementations of the gradient descent algorithm.

  • Generate synthetic data for a regression problem, and use Scipy lib to obtaine a baseline regression result.
  • A Numpy implementation of Batch Gradient Descent.
  • A Numpy implementation of Stochastic Gradient Descent and Mini-Batch Stocastic Gradient Descent.
  • A Distributed version of Mini-Batch Gradient Descent implemented using PySpark.

Beside i also compare those 4 implementations with the baseline from Scipy. A deep analysis is performed regarding how each algorithm's results change with parameters such as the learning rate and the number of iterations. Finally an analysis of the performance of all the algorithms in terms of dataset size.

This notebook is mainly about implementation and analysis the k-means algorithm on a 2D dataset:

  • A Numpy implementation of Vanilla k-means.
  • Analysis of algorithm convergence by Heterogeneity (total within Sum of Squares), and implementation the k -means++ technique for smart centroid initialization and allows to improve the quality of the local optima.
  • Implement the Elbow Method to help find the optimal value of the number of clusters.
  • A Distributed version of k-means is implemented with PySpark, the distributed version is analyzed and compared with the serial implementation.

This notebook is mainly about study the DataFrames API in Spark that allow developers to impose a structure on a distributed collection of data, allowing higher-level abstraction. We learn how to use DataFrame API and its advantages to manipulate distributed, structured data.

A hand-on experience is the analysis of flights data using SparkSQL. Data exploration is divided in three main sections: basic queries, flight volume statistics and additional queries. Data visualization is performed using the seaborn module.

About

This repository contains notebook projects for Distributed Systems and Cloud Computing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published