Skip to content

A conceptual roadmap to the available statistical approaches to managing & interpreting data

Notifications You must be signed in to change notification settings

joaoDragado/dSc_roadmap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Science Roadmap

Embarking on a foolhardy journey into the world of statistical inference & predictive analytics.


Introduction

The repository contains an outline of the techniques available to an aspiring data scientist to manipulate & gleam information from data.

This repo was borne out of necessity ; the world of Statistics has seen tremendous growth, and has branched out into paths, that while sharing the same aim, are clearly divergent in their philosophy & approach; trying to intelligently organise these (and their supporting code bases) quickly becomes a job in itself.

The idea is that, having a clear conceptual roadmap to all approaches ,coupled with their implementations to real-life examples,will assist the practitioner from answering the following questions :

  • "What kind of dataset do I need to investigate my claim/hypothesis?"

  • "Which statistical method/tool/test should I use with this dataset?"

  • "How can I interpret the results of my tests?"

  • "Are my results significant? Is my analysis worthy?"

The directory tree of the repository tries to distinguish between these different approaches, which might aid sorting through the breath of available choices, and hopefully selecting the appropriate one(s).

Each concept is (ideally) demonstrated via a real-life example / short project.


Table of Contents

  • Descriptive a.k.a. Exploratory Data Analysis

    collect summary information : mean, st.dev., outliers, etc.

  • Inferential

    the predictive aspect of statistics.

    • Classical

      • Frequentist

      • Bayesian

    • Machine_Learning

      • Supervised

        Labeled data are provided

        • Regression

          Aims to predict continuous valued output ~ trying to find answer within a range.

        • Classification

          Aims to predict discrete valued output ~ answer will be True/False, Red/Blue/Green, etc.

      • Unsupervised

        Our data are unlabebeled, the model needs to distinguish significant features within dataset.


Directory Tree

  • generated via the tree bash command.
.
├── Descriptive
├── Inferential
│   ├── Classical
│   │   ├── Bayesian
│   │   └── Frequentist
│   │       └── Common_Statistical_Tests.html
│   └── Machine_Learning
│       ├── Choosing_ML_Algorithm_RoadMap.png
│       ├── data
│       │   ├── baseball.csv
│       │   ├── boston.csv
│       │   ├── nba_test.csv
│       │   ├── nba_train.csv
│       │   ├── quality.csv
│       │   ├── stevens.csv
│       │   ├── wine.csv
│       │   └── wine_test.csv
│       ├── Supervised
│       │   ├── Classification
│       │   │   ├── Classification_via_ItSL.html
│       │   │   ├── Medical_Diagnosis_via_Logistic_Regression.ipynb
│       │   │   └── Supreme_Court_Opinions.ipynb
│       │   └── Regression
│       │       ├── Boston_House_Prices--Decision_Trees.ipynb
│       │       ├── Linear_Regression_via_ItSL.html
│       │       ├── NBA_Moneyball.ipynb
│       │       └── Predicting_Wine_prices.ipynb
│       └── Unsupervised
└── README.md

11 directories, 18 files

Dependencies - Libraries used

  • Python 3.6 (code should also work for Python 2.7)
  • Anaconda Python Data Science Distribution ; the easiest way to install & manage all scientific libraries used in this repo.

Reference Links

  1. Hastie Trevor, Efron Bradley, "Computer Age Statistical Inference - Algorithms, Evidence, and Data Science", 2017, Cambridge University Press

About

A conceptual roadmap to the available statistical approaches to managing & interpreting data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published