Read me

Repo directory:

Projects are split by folders

Topic areas

Causal inference

Causal regression - notebook
Causal regression with DoWhy - notebook
Double machine learning and marginal effects - notebook

Machine vision

Using Google's mediapipe to try simulate a 3D screen - folder
Using Google's mediapipe, measure the distance of a face to the screen from a webcam feed - folder
FashionCNN - Convolution neural network for predicting the Fashion MNIST dataset - notebook
FashionCNN - Batch normalisation layer applied to the above CNN model - notebook

Neural networks

Autoencoders - Using PCA to compress MNIST images - notebook
Autoencoders - Using a dense autoencoder to compress MNIST images - notebook
Implementing an elastic net model in PyTorch - notebook
Fitting distributions with variational inference - Simple example fitting a Gaussian distribution to data with Pyro - notebook
Fitting distributions with variational inference - Simple example fitting a beta distribution to data with Pyro - notebook
Fitting a multimodal beta distribution with Pytorch - notebook
Fitting a zero inflated Poisson distribution with Pytorch - notebook
PyTorch: Linear regression to non linear probabilistic neural network - notebook
TensorflowProbability: Linear regression to non linear probabilistic neural network - notebook
Trying out PyTorch Lightning - notebook
Tensorflow - Do Neural Networks overfit?notebook
Fitting a normal distribution with tensorflow probability - notebook
Binary loss functions - Is there a material difference between using BCEWithLogitsLoss and CrossEntropyLoss for binary classification tasks? - No - notebook
Does initialising the output of a neural net to match your target distribution help? - Yes - notebook

Recommenders

Exploring multi-armed bandit benchmarks - notebook

Regression

Bootstrapping regression coefficients - Confirming theoretical regression coefficient distributions with bootstrapped samples - notebook
Interaction coefficients regularisation - notebook
Sequential Bayesian linear regression model - notebook
Bayesian regression adapting to non-stationary data - notebook
Binomial regression vs logistic regression - notebook
Investigating double descent with linear regression - notebook

Time series

Speed of fitting and predict of neuralprophet vs fbprophet - notebook
Can we fit long AR models with neuralprophet - notebook

Tools/Python

Dask vs multiprocessing - Comparing the API of dask to multiprocessing for general functions - python
Parquet datasets - Exporting writing dataframes to partitioned parquet files - notebook
Data generating functions from drawing data - notebook

Other

Analysis into European installed energy capacity - notebook
The Game of Life computed with convolution - folder
NBA - Analysis into LeBron James playing minutes - notebook
TFL - Analysis in to the number of bike trips taken per day in London - notebook
NBA Score Trajectories - Flask app to show scores of a basketball match against time - repo
NBA Shooting - Kedro data pipelines to plot player scoring probability distributions - repo
The classic birthday problem - notebook

Installation

The various analysis was built in Python 3.

Virtual environment setup

Some projects have their own requirements/environment. The general setup is installed by:

python3 -m venv dataAnalysisEnv
source dataAnalysisEnv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt

Markdown from Notebooks

jupyter nbconvert notebook.ipynb --to markdown

This is automated via github actions.

Standard library

Custom library installed as a dev library for continued development

VSCode

Use the settings.json file in the repo

Future areas

Tools/areas to explore

Deep learning
- Pytorch
- Embeddings
- Tensorflow/pytorch - 1D functions
- FashionMNIST VAE
Causal inference
Data validations - great expectations
- https://github.com/tamsanh/kedro-great
Computer vision
NLP
- Keyword extraction from reviews etc.
- Sentiment analysis
Gaussian processes
Bayesian regression
Recommender systems
- Automatic playlist continuation
- Thompson sampling example
Quantile regression in pytorch
- Lasso regression
- Dropout better than regularisation?
Docker

Datasets to explore

Tasks

Build project template repo
Publish interpret-ml piece
NBA
- Player position classification model
- Bayesian sequential team rating
- Player VAE - how are players related
  - College stats to NBA VAE
M5/M4 forecasting
- Walmart demand forecasting
- with LightGBM
- Greykite
  - https://arxiv.org/abs/2105.01098
  - https://towardsdatascience.com/linkedins-response-to-prophet-silverkite-and-greykite-4fd0131f64cb
  - Imputation of missing regressors
  - Change points in seasonalities
  - Quantiles loss
  - Utilities for diagnosing
  - faster inference
  - Autoregressive
- Orbit
  - https://eng.uber.com/orbit/
PCA via embedding layer
NN to predict tempo from song, generate dummy dataset
- NN to predict tab from music sections
Word embeddings plot with hiplot
- Plot with PCA first and compare with hiplot
Compare linear regression MC dropout to theoretical results
Optimal car charging schedule based on energy prices or carbon output
Media pipe - 3d audio
- Face distance javascript web app with react
Covid UK plot against time on a map
- https://www.reddit.com/r/dataisbeautiful/comments/pay78n/oc_active_covid19_cases_per_capita_in_usa_1212020/
Autoencoder using transfer learning?
- what do we use for the decoder?
- MNIST auto-encoder to digit classifier
Fit a sinusoid to noisy data
- Fourier
- Gradient descent
- MCMC
- Variational inference
Double dip loss trajectories
Fitting NNs to common functions (exp etc.), deep vs wide, number of parameters for given error
Fit a NN to seasonal data with fourier series components
Causal inference
- DoubleML on heart data to find CATE
- DoubleML on dummy data vs other causal models. How robust are they to model mis-specification and missing confounders?
- Inverse propensity scoring - comparing different methods - manual Inverse Probability of Treatment Weighting, as variance in regression, sample weights, econML based. Do they match?
Hierarchical models
- Mixed effects model - is it the same as a fixed effects model (lin/log regression) with one hot encoding for the categorical variables + a fixed effect?
- Hierarchical bayesian models - for when we have categorical features with share effects over other features
- Fit with MCMC
- Similarities to ridge regression - only some coefficients are regularised
- Generate data and fit each model
- Ref
  - https://www.youtube.com/watch?v=38yOWMMCeMk&list=WL&index=5
Linear regression = logistic regression, relationship to Linear Thompson Sampling
Blurred images classifier
- ImageNet based, data augment to blur images.
Country embeddings - create country embeddings by predicting various macro level metrics (GDP, population etc. in a multi task model), from OHE through a NN. Does the embedding space make sense?
MovieLens dataset to get title embeddings, find nearest neighbour titles
- Using word2vec to predict similar titles. Train on movies watched. Similar given as titles streamed by the same customer
  - Train embedding for movies based on sequential ordering. Predict the next/middle movie.
Finding similar images in a photo library - given a few examples find similar photos
- Use an image net model. Find new example images, positive and negative. Fine tune the model via a classification task. Predict prob of positive result for unseen images. Use the latent space embeddings to find cosine similarity between images.
- Build small image dataset from cifar 10. Compare models - PCA/logistic regression, CNN, efficientNet, transfer learnt weights
- Build lookup table of image and its compact embedding. Given a new image find the inner product with the other images
Fourier transform via linear regression on sinusoids. Similar approach with Lasso regression to find compressed sensing approaches, with non-uniform sampling.
Multi task neural network training
- train a single model to predict multiple ready fields from a single dataset
A/B test distribution comparison
- We often compare just the means. If we find plot a Q-Q plot is it more informative, bootstrapping would construct confidence intervals
Non-stationarity with ADAM
- Can Adam optimisers adapt to non-stationary datasets. Therefore does batch ordering make a difference to the model coefficients.
- Compare against batch mode linear/logistic bayesian regression and show that data ordering is irrelevant.
Beta Bernouli bandit vs logistic regression with no features
NN multi-row vs multi-column - do they perform similarly?
Multi horizon forecasting direct method - with shared NN architecture - compare separate models for each horizon with a NN that shares layers. Compare with sequence to sequence models.
Gaussian process from scratch
- ref - https://www.youtube.com/watch?v=HA-VHNVbvwQ&list=WL&index=26
Probabilistic neural networks
- Normalizing flows - model complex distributions with transformations of gaussians
- Can we train an output layer as a gaussian mixture to model complex distirbutions via gradient descent
Common data science tasks
- Why do we need a model to find relationships? Conditional relationships are easier to define
- Association and causality
  - Associations fast - from clustering propensity model predictions and find average features in each group.
  - Causality - doubleML etc. to removing confounding features and finding conditional average treatment effects. How this relates to groupby and average.

TODO

rename environment/requirements files to match the notebook
update blog articles for markdown images
- TFL, Pyro notebooks re-run
Add year to each analysis link

Name		Name	Last commit message	Last commit date
Latest commit History 358 Commits
.github/workflows		.github/workflows
causal_inference		causal_inference
machine_vision		machine_vision
neural_networks		neural_networks
nlp		nlp
other		other
recommenders		recommenders
regression		regression
talks		talks
time_series		time_series
tools_python		tools_python
unfinished		unfinished
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert_to_markdown.sh		convert_to_markdown.sh
environment.yml		environment.yml
maths_notes.md		maths_notes.md
paper_list.md		paper_list.md
requirements-ci.txt		requirements-ci.txt
requirements.txt		requirements.txt
snippets.md		snippets.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Read me

Topic areas

Causal inference

Machine vision

Neural networks

Recommenders

Regression

Time series

Tools/Python

Other

Installation

Virtual environment setup

Markdown from Notebooks

Standard library

VSCode

Future areas

Tools/areas to explore

Datasets to explore

Tasks

TODO

About

Releases

Packages

Languages

License

stanton119/data-analysis

Folders and files

Latest commit

History

Repository files navigation

Read me

Topic areas

Causal inference

Machine vision

Neural networks

Recommenders

Regression

Time series

Tools/Python

Other

Installation

Virtual environment setup

Markdown from Notebooks

Standard library

VSCode

Future areas

Tools/areas to explore

Datasets to explore

Tasks

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages