Distributed Systems and Cloud Computing [CLOUDS]

This repository contains my project notebooks for the Distributed Systems and Cloud Computing.

Notebooks

Lab1 Introduction

In this introductory laboratory:

Acquire Pandas, Matplotlib, Numpy
Gain familiarity with the PySpark and how to interact with the HDFS
Gain hands-on experience with two Distributed codes:
- Word Count
- Analysis of night flights

Lab2 Distributed Gradient Descent

This notebook contains multiple implementations of the gradient descent algorithm.

Generate synthetic data for a regression problem, and use Scipy lib to obtaine a baseline regression result.
A Numpy implementation of Batch Gradient Descent.
A Numpy implementation of Stochastic Gradient Descent and Mini-Batch Stocastic Gradient Descent.
A Distributed version of Mini-Batch Gradient Descent implemented using PySpark.

Beside i also compare those 4 implementations with the baseline from Scipy. A deep analysis is performed regarding how each algorithm's results change with parameters such as the learning rate and the number of iterations. Finally an analysis of the performance of all the algorithms in terms of dataset size.

Lab3 Distributed k-Means

This notebook is mainly about implementation and analysis the k-means algorithm on a 2D dataset:

A Numpy implementation of Vanilla k-means.
Analysis of algorithm convergence by Heterogeneity (total within Sum of Squares), and implementation the k -means++ technique for smart centroid initialization and allows to improve the quality of the local optima.
Implement the Elbow Method to help find the optimal value of the number of clusters.
A Distributed version of k-means is implemented with PySpark, the distributed version is analyzed and compared with the serial implementation.

Lab4 Flights Data Analysis

This notebook is mainly about study the DataFrames API in Spark that allow developers to impose a structure on a distributed collection of data, allowing higher-level abstraction. We learn how to use DataFrame API and its advantages to manipulate distributed, structured data.

A hand-on experience is the analysis of flights data using SparkSQL. Data exploration is divided in three main sections: basic queries, flight volume statistics and additional queries. Data visualization is performed using the seaborn module.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
images		images
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Systems and Cloud Computing [CLOUDS]

Notebooks

Lab1 Introduction

Lab2 Distributed Gradient Descent

Lab3 Distributed k-Means

Lab4 Flights Data Analysis

About

Releases

Packages

Languages

JZ-LIANG/Distributed-Systems-and-Cloud-Computing

Folders and files

Latest commit

History

Repository files navigation

Distributed Systems and Cloud Computing [CLOUDS]

Notebooks

Lab1 Introduction

Lab2 Distributed Gradient Descent

Lab3 Distributed k-Means

Lab4 Flights Data Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages