Skip to content

Latest commit

 

History

History
7 lines (4 loc) · 2.34 KB

README.md

File metadata and controls

7 lines (4 loc) · 2.34 KB

PyTorch model on Spark using SparkTorch

This repository contains notebooks for the training and evaluation of neural networks on Spark's engine. The used dataset is the MNIST handwritten dataset by Y. LeCun and all 60000 training samples are utilized, while the model is evaluated on all 10000 test samples. The model is a simple CNN model with linear layers at its end and can be seen in the NNCode.py file of the present directory. Due to the limitations of SparkML, we attempt to use a PyTorch model, without giving up on the benefits of using Spark's engine. To get the best out of both worlds, we make use of the SparkTorch library.

An important thing to note is that the notebooks correspond to Databricks notebooks, thus having several major differences from traditional Jupyter Notebooks (for example, a spark context is nowhere explicitly defined, as the Spark engine is built-in on Databricks). In addition, the used data are retrieved from Azure's Data Lake Gen 2 Storage using Microsoft's good practices, in order to make things somewhat more realistic for industry applications, where ingested data automatically end up in such Storage spaces, instead of being manually loaded locally.

The reason why two notebooks exist in the present repository is the following: the main notebook (SparkTorch version), presents the analysis described in the above paragraphs. We have also included an additional notebook (Traditional version), where we have performed the same analysis using a somewhat more traditional approach (perhaps educational or academic is a more suitable word instead of traditional). That is, we load the data into Pandas dataframes, perform all pre-processing steps using Scikit-learn and then train the PyTorch model using PyTorch's standard practices (defining datasets and dataloaders, explicit training and evaluation routines, etc.). This is done only to demonstrate that, while the two processes are somewhat equivalent, the framework in which they are performed is so important that the two resulting pipelines are quite different from one another. Of course, the main difference is that the traditional approach does not make use of Spark's engine and is therefore not scalable in cases where the number of data samples increases exponentially.