Spark Project Analysis Example

Example of machine learning at scale (distributed data). Developing a distributed PySpark pipeline for implementing logistic regression and random forest algorithms to predict click-through rates.

Steps to follow files in this repository

Load_Parquet_files.ipynb: takes raw data and converts to parquet and dataframe formats

EDA_Pandas.ipynb: converts raw data to pandas dataframe and performs fulsome EDA in Pandas

EDA_Spark.ipynb: converts parquet files to Spark dataframe and performs light EDA in Spark

FeatureEngineering-Spark.ipynb: takes Spark dataframe and performs light EDA checks and data processing

Data_Processing.ipynb: takes Spark dataframe and performs data processing required for creating processed dataframe for algorithm implementation

Logistic_regression_implementation.ipynb: takes processed dataframe and performs logistic regression algorithm implementation (with and without hash transformation)

Random_forest_implementation.ipynb: takes processed dataframe and performs random forest algorithm implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark Project Analysis Example

Steps to follow files in this repository

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Data_Processing.ipynb		Data_Processing.ipynb
EDA_Pandas.ipynb		EDA_Pandas.ipynb
EDA_Spark.ipynb		EDA_Spark.ipynb
FeatureEngineering-Spark.ipynb		FeatureEngineering-Spark.ipynb
Load_Parquet_files.ipynb		Load_Parquet_files.ipynb
Logistic_regression_implementation.ipynb		Logistic_regression_implementation.ipynb
README.md		README.md
Random_forest_implementation.ipynb		Random_forest_implementation.ipynb

dalvarez83/Spark_Project_Analysis_Example

Folders and files

Latest commit

History

Repository files navigation

Spark Project Analysis Example

Steps to follow files in this repository

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages