|
1 | 1 |
|
| 2 | + |
| 3 | + |
| 4 | +20221112 - nyc exercise |
| 5 | +- data |
| 6 | + - mon 11/07 - 65 row -> 65 |
| 7 | + - tue 11/08 - 0 -> 65 |
| 8 | + - wed 11/09 - 30 bike + 45 bike -> 140 |
| 9 | + - thu 11/10 - 30 bike + 45 bike + 5 stair -> 220 |
| 10 | + - fri 11/11 - 30 bike -> 250 |
| 11 | + |
| 12 | + |
| 13 | +20221111 - pydata nyc 2022 - cool stuff to look at |
| 14 | +- check env from the causal inf talk |
| 15 | + - https://github.com/ronikobrosly/pydata_nyc_2022/blob/main/check_environment.py |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | + |
| 20 | +20221111 - pydata nyc 2022 - causal inference |
| 21 | +- 3 types of causal relationships |
| 22 | + - cofounder |
| 23 | + - a cofounder is a third variable to causes both the tratment and outcome |
| 24 | + - always need to control |
| 25 | + - ie: smoking leads to cancer and leads to coffee |
| 26 | + - but if not take into account smoking, looks like coffee correlates with cancer |
| 27 | + - ie: if correlation between ice cream sales and violent crime |
| 28 | + - but hot weather is the cofounder, since it leads to oth ice creame and violent crime |
| 29 | + - colliders (dont want to contriol for) |
| 30 | + - inverse of confounder |
| 31 | + - if smoking is correlated to lung cancer |
| 32 | + - collider is # sign days |
| 33 | + - mediators (don't want to control for) |
| 34 | + - sits between treatment and outcome |
| 35 | + - ie: cliical signs of lung damage |
| 36 | + - if control for lung damage, will lose the causal rekationship between smoking and lung cancer |
| 37 | + |
| 38 | +- confounder |
| 39 | + - need to condition for it |
| 40 | + - can remove the data (bayes) |
| 41 | + - use a model |
| 42 | + |
| 43 | + |
| 44 | +- traditional variabe importance methods don't tell you anything about causality |
| 45 | + - shap, permutation importance could be issue |
| 46 | + - don't condition based on these metrics |
| 47 | + |
| 48 | +- assumptions of causal inference |
| 49 | + - - temporarlity |
| 50 | + - |
| 51 | + |
| 52 | +- g-computation |
| 53 | + - also look nto propensity score matching |
| 54 | + - want to avoid including collider vars in model vars that will be conditioned for |
| 55 | + - as opposed to doing linear regression, can use xgboost for the model |
| 56 | + - with LR, can get some information from the gradients for the causation effect |
| 57 | + |
| 58 | + |
| 59 | + |
| 60 | + |
| 61 | +20221111 - pydata nyc 2022 - serving pytirch models in production |
| 62 | +- walmart |
| 63 | + - tf models used |
| 64 | + - java shop, so serve models using JNI (java native interface) |
| 65 | + - wanted to use BERT in pytorch |
| 66 | + |
| 67 | +- data used |
| 68 | + - amazon berkeley objects data |
| 69 | + |
| 70 | +- optimizing models for production |
| 71 | + - port training optimization -> quantization |
| 72 | + - store tensors at lower bit fp precision; reduce memorey and speed up |
| 73 | + - hardware in8 computations 2 to 4x faster than fp32 computations |
| 74 | + - ie |
| 75 | + model_int = torch.quantization.quantize_dynamic() |
| 76 | + - |
| 77 | + |
| 78 | + |
| 79 | + |
| 80 | +20221111 - pydata nyc 2022 - DL for time series analysis |
| 81 | +- codee |
| 82 | + - https://www.kaggle.com/isaacmg/code |
| 83 | + |
| 84 | +- gisttory of time series DL |
| 85 | + - 2015 - vanilla lstm |
| 86 | + - 2017 - GRUIs, DA-RNN |
| 87 | + - 2019 - transformers |
| 88 | + - 2020 - emergence of DNN |
| 89 | + |
| 90 | +- forecasting, trying to determine feature variable at a future time step |
| 91 | +- classification, trying to assign labell to sequence of time steps |
| 92 | +- time series analysis - assign 0/1 label to sequence of time steps |
| 93 | + |
| 94 | +- pan points |
| 95 | + - how to incorporrate additional information |
| 96 | + |
| 97 | +- tikme series forecasting industry is fractureed |
| 98 | + |
| 99 | +- stack overflow q (igodfried) |
| 100 | + - training loss is NAN keran nn |
| 101 | + |
| 102 | + |
| 103 | +- q/a session |
| 104 | + - need 50k rows for good |
| 105 | + - got transformers, can spend days during hyperparameter searching |
| 106 | + - harder to do transfer learning for time series NNs than for NLP |
| 107 | + |
| 108 | + |
| 109 | +20221111 - pydata nyc 2022 - dask tutorial |
| 110 | +- code |
| 111 | + - https://github.com/mrocklin/dask-tutorial |
| 112 | + |
| 113 | +- avocado forecast flow example |
| 114 | + - https://www.kaggle.com/code/isaacmg/avocado-price-forecasting-with-flow-forecast-ff |
| 115 | + - use 4 past time steps to forecast one time step |
| 116 | + - use https://wandb.ai/site for logging traiuning model logging |
| 117 | + - can get figures of loss rate, etc |
| 118 | +- avocado GRU example, probabilistic model |
| 119 | + - https://www.kaggle.com/code/isaacmg/probablistic-gru-avocado-price-forecast |
| 120 | +- avocado multi region transaformer |
| 121 | + - https://www.kaggle.com/code/isaacmg/multi-region-transformer |
| 122 | + |
| 123 | + |
| 124 | +20221110 - pydata nyc interestimg for DCC |
| 125 | + |
| 126 | +- prodyucst |
| 127 | + - DVC for data version control |
| 128 | + |
| 129 | + |
| 130 | + |
| 131 | + |
| 132 | +2022110 - pydata nyc 2022 - ML at scale for finances (quansight team) |
| 133 | +- look at |
| 134 | + - amd ROCM similar to nvidia CUDA |
| 135 | + - dask kubernetes |
| 136 | + |
| 137 | + |
| 138 | +- gpu ecosystem |
| 139 | + - pytorch, ttf, numba, dask, rapids, heavy.ai, cuDF, blazinfsql (deprecated for dasksql) |
| 140 | + |
| 141 | +- lexssons |
| 142 | + - io ops from gpu to host very closeness_centrality- gpu expensive |
| 143 | + - python cuda ecosystem is great |
| 144 | + |
| 145 | +- libs |
| 146 | + - dask |
| 147 | + - prefect data workflow orchestration tool |
| 148 | + - argo workflows |
| 149 | + - specifically for kubernetes? |
| 150 | + |
| 151 | + |
| 152 | + |
| 153 | + |
| 154 | + |
| 155 | + |
| 156 | + |
| 157 | +20221110 - pydata nyc 2022 - data and model version control in drug discovery pipelines |
| 158 | +- barreto-ojeda, cyclica inc |
| 159 | + |
| 160 | +- numbers |
| 161 | + - genomics, 25k genes |
| 162 | + - rna, transcriptomics 1m transcripts |
| 163 | + - protein, proteomics 20m protein (from alphafold) |
| 164 | + - also meta published 1m proteins |
| 165 | + - metabolites, metabolomics 5k |
| 166 | + |
| 167 | +- data |
| 168 | + - low number of observations (samples), high number of varoables (feature) |
| 169 | + - ie: 1 sample can gtet 100 tumors |
| 170 | + - very high dimensional data |
| 171 | + - protocals are not always reproduceable |
| 172 | + - research based data |
| 173 | + - more data on hot topics |
| 174 | + - complex biological data |
| 175 | + - dissimilar (diverse format and contents) |
| 176 | + - imbaanced (more data for given featire) |
| 177 | + - redundnt |
| 178 | + - spares (lacks annotaTIONS) |
| 179 | + |
| 180 | +- DVC (data version control) |
| 181 | + - open soruce, works with all github providers (github, gitlab, bitbucket) |
| 182 | + |
| 183 | + |
| 184 | +- code |
| 185 | + - pip list | grep dvc |
| 186 | + |
| 187 | + - to display chem structure |
| 188 | + import pandas as pd |
| 189 | + data = pd.read_csv('./data/initial_data.csv') |
| 190 | + data.head() |
| 191 | + from rdkit import Chem |
| 192 | + from rdkit.Chem.Draw import IPythonConsole |
| 193 | + IPythonConsole.drawOptions.addAtomIndices = True |
| 194 | + IPythonConsole.molSize = 400,400 |
| 195 | + |
| 196 | + mol = Chem.MolFromSmiles(data['smiles'][1]) |
| 197 | + mol |
| 198 | + import pubchempy as pcp |
| 199 | + |
| 200 | + id_ = pcp.get_compounds(data['smiles'][1], 'smiles')[0] |
| 201 | + id_.synonyms[0] |
| 202 | + |
| 203 | + |
| 204 | +- dvs steps |
| 205 | + - dvc add |
| 206 | + - dvc dag -> will print dependencies |
| 207 | + |
| 208 | + |
| 209 | + |
| 210 | + |
2 | 211 | 20221110 - pydata nyc 2022 - workflow engines by sanjay from Akasa
|
3 | 212 | - akasa company
|
4 | 213 | - number 1 cause of bankruptcy is medical bills
|
|
7 | 216 |
|
8 | 217 | - use ML to automate tasks
|
9 | 218 |
|
| 219 | +- two types of automation |
| 220 | + - workflow automation |
| 221 | + - move data from EHR to insurance UI |
| 222 | + - read response from website UI |
| 223 | + - update state of the claim in the EHR |
| 224 | + - information extraction |
| 225 | + |
| 226 | +- saying in healthcare |
| 227 | + - if you've seen one hospital, you've seen one hospital |
| 228 | + |
| 229 | +- technical |
| 230 | + - the workflow stes are async |
| 231 | + - also not safe to redo steps if later step fials |
| 232 | + - if 10 steps, if step 5 fails, then not necessarily safe to redo step 1 to 4 |
| 233 | + - do replayable workflow steps by havng the step issue a token for its request |
| 234 | + - if restart from start, identical tokens will just return dayta from cache, not rerunning step |
| 235 | + - steps implemented in fuctions that are decorated with @wolkflow.task |
| 236 | + - the ecorator pushes the function onto a queue |
| 237 | + - with workers getting workflow tasks from queue, never blocking |
| 238 | + - workers pick up other tasks that are queued up while waiting for the other tasks to complete |
| 239 | + |
| 240 | + |
| 241 | +- AWS example |
| 242 | + - when starting an EC2 instance |
| 243 | + - create network int, EBS FS, etc |
| 244 | + |
| 245 | + |
| 246 | + |
| 247 | + |
10 | 248 |
|
11 | 249 | 20221110 - pydata nyc 2022 - stitchfix feature engineering framework
|
12 | 250 | - book
|
|
0 commit comments