Skip to content

stanton119/data-analysis

Repository files navigation

Read me

Repo directory:

  • Projects are split by folders

Topic areas

Causal inference

  • Causal regression - notebook
  • Causal regression with DoWhy - notebook
  • Double machine learning and marginal effects - notebook

Machine vision

  • Using Google's mediapipe to try simulate a 3D screen - folder
  • Using Google's mediapipe, measure the distance of a face to the screen from a webcam feed - folder
  • FashionCNN - Convolution neural network for predicting the Fashion MNIST dataset - notebook
  • FashionCNN - Batch normalisation layer applied to the above CNN model - notebook

Neural networks

  • Autoencoders - Using PCA to compress MNIST images - notebook
  • Autoencoders - Using a dense autoencoder to compress MNIST images - notebook
  • Implementing an elastic net model in PyTorch - notebook
  • Fitting distributions with variational inference - Simple example fitting a Gaussian distribution to data with Pyro - notebook
  • Fitting distributions with variational inference - Simple example fitting a beta distribution to data with Pyro - notebook
  • Fitting a multimodal beta distribution with Pytorch - notebook
  • Fitting a zero inflated Poisson distribution with Pytorch - notebook
  • PyTorch: Linear regression to non linear probabilistic neural network - notebook
  • TensorflowProbability: Linear regression to non linear probabilistic neural network - notebook
  • Trying out PyTorch Lightning - notebook
  • Tensorflow - Do Neural Networks overfit?notebook
  • Fitting a normal distribution with tensorflow probability - notebook
  • Binary loss functions - Is there a material difference between using BCEWithLogitsLoss and CrossEntropyLoss for binary classification tasks? - No - notebook
  • Does initialising the output of a neural net to match your target distribution help? - Yes - notebook

Recommenders

  • Exploring multi-armed bandit benchmarks - notebook

Regression

  • Bootstrapping regression coefficients - Confirming theoretical regression coefficient distributions with bootstrapped samples - notebook
  • Interaction coefficients regularisation - notebook
  • Sequential Bayesian linear regression model - notebook
  • Bayesian regression adapting to non-stationary data - notebook
  • Binomial regression vs logistic regression - notebook
  • Investigating double descent with linear regression - notebook

Time series

  • Speed of fitting and predict of neuralprophet vs fbprophet - notebook
  • Can we fit long AR models with neuralprophet - notebook

Tools/Python

  • Dask vs multiprocessing - Comparing the API of dask to multiprocessing for general functions - python
  • Parquet datasets - Exporting writing dataframes to partitioned parquet files - notebook
  • Data generating functions from drawing data - notebook

Other

  • Analysis into European installed energy capacity - notebook
  • The Game of Life computed with convolution - folder
  • NBA - Analysis into LeBron James playing minutes - notebook
  • TFL - Analysis in to the number of bike trips taken per day in London - notebook
  • NBA Score Trajectories - Flask app to show scores of a basketball match against time - repo
  • NBA Shooting - Kedro data pipelines to plot player scoring probability distributions - repo
  • The classic birthday problem - notebook

Installation

The various analysis was built in Python 3.

Virtual environment setup

Some projects have their own requirements/environment. The general setup is installed by:

python3 -m venv dataAnalysisEnv
source dataAnalysisEnv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt

Markdown from Notebooks

jupyter nbconvert notebook.ipynb --to markdown

This is automated via github actions.

Standard library

Custom library installed as a dev library for continued development

VSCode

Use the settings.json file in the repo

Future areas

Tools/areas to explore

  • Deep learning
    • Pytorch
    • Embeddings
    • Tensorflow/pytorch - 1D functions
    • FashionMNIST VAE
  • Causal inference
  • Data validations - great expectations
  • Computer vision
  • NLP
    • Keyword extraction from reviews etc.
    • Sentiment analysis
  • Gaussian processes
  • Bayesian regression
  • Recommender systems
    • Automatic playlist continuation
    • Thompson sampling example
  • Quantile regression in pytorch
    • Lasso regression
    • Dropout better than regularisation?
  • Docker

Datasets to explore

Tasks

  • Build project template repo
  • Publish interpret-ml piece
  • NBA
    • Player position classification model
    • Bayesian sequential team rating
    • Player VAE - how are players related
      • College stats to NBA VAE
  • M5/M4 forecasting
  • PCA via embedding layer
  • NN to predict tempo from song, generate dummy dataset
    • NN to predict tab from music sections
  • Word embeddings plot with hiplot
    • Plot with PCA first and compare with hiplot
  • Compare linear regression MC dropout to theoretical results
  • Optimal car charging schedule based on energy prices or carbon output
  • Media pipe - 3d audio
    • Face distance javascript web app with react
  • Covid UK plot against time on a map
  • Autoencoder using transfer learning?
    • what do we use for the decoder?
    • MNIST auto-encoder to digit classifier
  • Fit a sinusoid to noisy data
    • Fourier
    • Gradient descent
    • MCMC
    • Variational inference
  • Double dip loss trajectories
  • Fitting NNs to common functions (exp etc.), deep vs wide, number of parameters for given error
  • Fit a NN to seasonal data with fourier series components
  • Causal inference
    • DoubleML on heart data to find CATE
    • DoubleML on dummy data vs other causal models. How robust are they to model mis-specification and missing confounders?
    • Inverse propensity scoring - comparing different methods - manual Inverse Probability of Treatment Weighting, as variance in regression, sample weights, econML based. Do they match?
  • Hierarchical models
    • Mixed effects model - is it the same as a fixed effects model (lin/log regression) with one hot encoding for the categorical variables + a fixed effect?
    • Hierarchical bayesian models - for when we have categorical features with share effects over other features
    • Fit with MCMC
    • Similarities to ridge regression - only some coefficients are regularised
    • Generate data and fit each model
    • Ref
  • Linear regression = logistic regression, relationship to Linear Thompson Sampling
  • Blurred images classifier
    • ImageNet based, data augment to blur images.
  • Country embeddings - create country embeddings by predicting various macro level metrics (GDP, population etc. in a multi task model), from OHE through a NN. Does the embedding space make sense?
  • MovieLens dataset to get title embeddings, find nearest neighbour titles
    • Using word2vec to predict similar titles. Train on movies watched. Similar given as titles streamed by the same customer
      • Train embedding for movies based on sequential ordering. Predict the next/middle movie.
  • Finding similar images in a photo library - given a few examples find similar photos
    • Use an image net model. Find new example images, positive and negative. Fine tune the model via a classification task. Predict prob of positive result for unseen images. Use the latent space embeddings to find cosine similarity between images.
    • Build small image dataset from cifar 10. Compare models - PCA/logistic regression, CNN, efficientNet, transfer learnt weights
    • Build lookup table of image and its compact embedding. Given a new image find the inner product with the other images
  • Fourier transform via linear regression on sinusoids. Similar approach with Lasso regression to find compressed sensing approaches, with non-uniform sampling.
  • Multi task neural network training
    • train a single model to predict multiple ready fields from a single dataset
  • A/B test distribution comparison
    • We often compare just the means. If we find plot a Q-Q plot is it more informative, bootstrapping would construct confidence intervals
  • Non-stationarity with ADAM
    • Can Adam optimisers adapt to non-stationary datasets. Therefore does batch ordering make a difference to the model coefficients.
    • Compare against batch mode linear/logistic bayesian regression and show that data ordering is irrelevant.
  • Beta Bernouli bandit vs logistic regression with no features
  • NN multi-row vs multi-column - do they perform similarly?
  • Multi horizon forecasting direct method - with shared NN architecture - compare separate models for each horizon with a NN that shares layers. Compare with sequence to sequence models.
  • Gaussian process from scratch
  • Probabilistic neural networks
    • Normalizing flows - model complex distributions with transformations of gaussians
    • Can we train an output layer as a gaussian mixture to model complex distirbutions via gradient descent
  • Common data science tasks
    • Why do we need a model to find relationships? Conditional relationships are easier to define
    • Association and causality
      • Associations fast - from clustering propensity model predictions and find average features in each group.
      • Causality - doubleML etc. to removing confounding features and finding conditional average treatment effects. How this relates to groupby and average.

TODO

  • rename environment/requirements files to match the notebook
  • update blog articles for markdown images
    • TFL, Pyro notebooks re-run
  • Add year to each analysis link

About

Various data analysis work

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages