Machine learning regression and classification model templates
on multiple platforms including
PyTorch
, scikit-learn
, and Spark MLLib
.
A variety of utility tools are also included in the package.
This repository was built to make your life easier when starting a new ML project.
The source code files came from various projects the developer had worked on,
which may not be directly applicable to your problem yet should be easily transferable.
If you have any questions or valuable suggestions,
you are very welcome to contact the developer at
raymondwang@u.northwestern.edu
This repository contains some data acquisition and feature engineering utility tools. Please see details enumerated below.
[1] Toy dataset for both regression and classification tasks can be found in UCI_repo.tar.gz
[2] Template for scraping data from unstructured online resources (e.g. balance sheet)
is provided in scrape_website.py
.
Google Chrome drivers for both MacOS and Linux can be found
here.
[3] An example of fetching materials data from the Materials Project
using their API is given in fetch_MPdata.py
.
Parallel post-processing of original data is included in the source code,
where the function to be parallelized could be easily modified for your own task.
[4] An example of obtaining economic data from FRED is presented in
fred_VAR.py
. Vector autoregression is used to analyze and predict the economic trend.
[5] high_frequency_trade.py
demonstrates some basic feature engineering skills applicable to
limit order books
for high frequency trading tasks.
Simple linear regression is used here for fundamental feature analysis.
[6] stocks.py
is a small program for stock beta prediction. It is just a toy model.
This repository contains multiple machine learning regression and classification model templates using the scikit-learn package, including:
Simple linear regression
Linear SVR
AdaBoost regresion
Gradient boosting regression/classification
Kernel ridge regression
SGD regression
Lasso-Lars regression
Multilayer perceptron regression
XGBoost regression
KNN classification
Support vector machine classification
Gaussian processes classification
We also provide tools to plot correlation heatmap (correlation_heatmap.py
)
as well as learning curve (learning_curve.py
).
This repository contains the major components required for a typical deep learning project, i.e.
main.py # user/developer interface which defines modle parameters and work flow
data.py # driver program for data loader
model.py # the deep neural network model is defined here
predict.py
is not necessary but we include it here for user convenience.
It bypasses the model training process and
directly loads the pre-trained network to make predictions.
I hope you find contents in this repository interesting.
Spark_setup.md
We provide step-by-step tutorial on how to build your own mini-computer cluster
with Slurm job scheduler and Apache Spark.
In this tutorial, we use 2 Dell Precision Tower workstations and 2 Raspberry Pi machines
to build this Spark cluster.
However, it is recommended to use the same or similar architectures for the cluster,
which will improve performance.
Slurm_config
directory contains configuration files for Slurm job scheduler.
It is possible to integrate Spark into Slurm system for better job management.
Once you have the Spark cluster ready for job submission, we provide several simple examples to run Spark applications.
word_count.py # PySpark program for distributed word-count tasks
regression_models.py # linear regression and decision tree regression models
cross_validation.py # an example of doing cross-validation in PySpark