Skip to content

Commit b25cb44

Browse files
committed
notes: adding pydata nyc 2022 day 03 talks
1 parent 9ead65a commit b25cb44

File tree

1 file changed

+238
-0
lines changed

1 file changed

+238
-0
lines changed

Notes/ML/2020mlScratchPad.txt

Lines changed: 238 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,213 @@
11

2+
3+
4+
20221112 - nyc exercise
5+
- data
6+
- mon 11/07 - 65 row -> 65
7+
- tue 11/08 - 0 -> 65
8+
- wed 11/09 - 30 bike + 45 bike -> 140
9+
- thu 11/10 - 30 bike + 45 bike + 5 stair -> 220
10+
- fri 11/11 - 30 bike -> 250
11+
12+
13+
20221111 - pydata nyc 2022 - cool stuff to look at
14+
- check env from the causal inf talk
15+
- https://github.com/ronikobrosly/pydata_nyc_2022/blob/main/check_environment.py
16+
17+
18+
19+
20+
20221111 - pydata nyc 2022 - causal inference
21+
- 3 types of causal relationships
22+
- cofounder
23+
- a cofounder is a third variable to causes both the tratment and outcome
24+
- always need to control
25+
- ie: smoking leads to cancer and leads to coffee
26+
- but if not take into account smoking, looks like coffee correlates with cancer
27+
- ie: if correlation between ice cream sales and violent crime
28+
- but hot weather is the cofounder, since it leads to oth ice creame and violent crime
29+
- colliders (dont want to contriol for)
30+
- inverse of confounder
31+
- if smoking is correlated to lung cancer
32+
- collider is # sign days
33+
- mediators (don't want to control for)
34+
- sits between treatment and outcome
35+
- ie: cliical signs of lung damage
36+
- if control for lung damage, will lose the causal rekationship between smoking and lung cancer
37+
38+
- confounder
39+
- need to condition for it
40+
- can remove the data (bayes)
41+
- use a model
42+
43+
44+
- traditional variabe importance methods don't tell you anything about causality
45+
- shap, permutation importance could be issue
46+
- don't condition based on these metrics
47+
48+
- assumptions of causal inference
49+
- - temporarlity
50+
-
51+
52+
- g-computation
53+
- also look nto propensity score matching
54+
- want to avoid including collider vars in model vars that will be conditioned for
55+
- as opposed to doing linear regression, can use xgboost for the model
56+
- with LR, can get some information from the gradients for the causation effect
57+
58+
59+
60+
61+
20221111 - pydata nyc 2022 - serving pytirch models in production
62+
- walmart
63+
- tf models used
64+
- java shop, so serve models using JNI (java native interface)
65+
- wanted to use BERT in pytorch
66+
67+
- data used
68+
- amazon berkeley objects data
69+
70+
- optimizing models for production
71+
- port training optimization -> quantization
72+
- store tensors at lower bit fp precision; reduce memorey and speed up
73+
- hardware in8 computations 2 to 4x faster than fp32 computations
74+
- ie
75+
model_int = torch.quantization.quantize_dynamic()
76+
-
77+
78+
79+
80+
20221111 - pydata nyc 2022 - DL for time series analysis
81+
- codee
82+
- https://www.kaggle.com/isaacmg/code
83+
84+
- gisttory of time series DL
85+
- 2015 - vanilla lstm
86+
- 2017 - GRUIs, DA-RNN
87+
- 2019 - transformers
88+
- 2020 - emergence of DNN
89+
90+
- forecasting, trying to determine feature variable at a future time step
91+
- classification, trying to assign labell to sequence of time steps
92+
- time series analysis - assign 0/1 label to sequence of time steps
93+
94+
- pan points
95+
- how to incorporrate additional information
96+
97+
- tikme series forecasting industry is fractureed
98+
99+
- stack overflow q (igodfried)
100+
- training loss is NAN keran nn
101+
102+
103+
- q/a session
104+
- need 50k rows for good
105+
- got transformers, can spend days during hyperparameter searching
106+
- harder to do transfer learning for time series NNs than for NLP
107+
108+
109+
20221111 - pydata nyc 2022 - dask tutorial
110+
- code
111+
- https://github.com/mrocklin/dask-tutorial
112+
113+
- avocado forecast flow example
114+
- https://www.kaggle.com/code/isaacmg/avocado-price-forecasting-with-flow-forecast-ff
115+
- use 4 past time steps to forecast one time step
116+
- use https://wandb.ai/site for logging traiuning model logging
117+
- can get figures of loss rate, etc
118+
- avocado GRU example, probabilistic model
119+
- https://www.kaggle.com/code/isaacmg/probablistic-gru-avocado-price-forecast
120+
- avocado multi region transaformer
121+
- https://www.kaggle.com/code/isaacmg/multi-region-transformer
122+
123+
124+
20221110 - pydata nyc interestimg for DCC
125+
126+
- prodyucst
127+
- DVC for data version control
128+
129+
130+
131+
132+
2022110 - pydata nyc 2022 - ML at scale for finances (quansight team)
133+
- look at
134+
- amd ROCM similar to nvidia CUDA
135+
- dask kubernetes
136+
137+
138+
- gpu ecosystem
139+
- pytorch, ttf, numba, dask, rapids, heavy.ai, cuDF, blazinfsql (deprecated for dasksql)
140+
141+
- lexssons
142+
- io ops from gpu to host very closeness_centrality- gpu expensive
143+
- python cuda ecosystem is great
144+
145+
- libs
146+
- dask
147+
- prefect data workflow orchestration tool
148+
- argo workflows
149+
- specifically for kubernetes?
150+
151+
152+
153+
154+
155+
156+
157+
20221110 - pydata nyc 2022 - data and model version control in drug discovery pipelines
158+
- barreto-ojeda, cyclica inc
159+
160+
- numbers
161+
- genomics, 25k genes
162+
- rna, transcriptomics 1m transcripts
163+
- protein, proteomics 20m protein (from alphafold)
164+
- also meta published 1m proteins
165+
- metabolites, metabolomics 5k
166+
167+
- data
168+
- low number of observations (samples), high number of varoables (feature)
169+
- ie: 1 sample can gtet 100 tumors
170+
- very high dimensional data
171+
- protocals are not always reproduceable
172+
- research based data
173+
- more data on hot topics
174+
- complex biological data
175+
- dissimilar (diverse format and contents)
176+
- imbaanced (more data for given featire)
177+
- redundnt
178+
- spares (lacks annotaTIONS)
179+
180+
- DVC (data version control)
181+
- open soruce, works with all github providers (github, gitlab, bitbucket)
182+
183+
184+
- code
185+
- pip list | grep dvc
186+
187+
- to display chem structure
188+
import pandas as pd
189+
data = pd.read_csv('./data/initial_data.csv')
190+
data.head()
191+
from rdkit import Chem
192+
from rdkit.Chem.Draw import IPythonConsole
193+
IPythonConsole.drawOptions.addAtomIndices = True
194+
IPythonConsole.molSize = 400,400
195+
196+
mol = Chem.MolFromSmiles(data['smiles'][1])
197+
mol
198+
import pubchempy as pcp
199+
200+
id_ = pcp.get_compounds(data['smiles'][1], 'smiles')[0]
201+
id_.synonyms[0]
202+
203+
204+
- dvs steps
205+
- dvc add
206+
- dvc dag -> will print dependencies
207+
208+
209+
210+
2211
20221110 - pydata nyc 2022 - workflow engines by sanjay from Akasa
3212
- akasa company
4213
- number 1 cause of bankruptcy is medical bills
@@ -7,6 +216,35 @@
7216

8217
- use ML to automate tasks
9218

219+
- two types of automation
220+
- workflow automation
221+
- move data from EHR to insurance UI
222+
- read response from website UI
223+
- update state of the claim in the EHR
224+
- information extraction
225+
226+
- saying in healthcare
227+
- if you've seen one hospital, you've seen one hospital
228+
229+
- technical
230+
- the workflow stes are async
231+
- also not safe to redo steps if later step fials
232+
- if 10 steps, if step 5 fails, then not necessarily safe to redo step 1 to 4
233+
- do replayable workflow steps by havng the step issue a token for its request
234+
- if restart from start, identical tokens will just return dayta from cache, not rerunning step
235+
- steps implemented in fuctions that are decorated with @wolkflow.task
236+
- the ecorator pushes the function onto a queue
237+
- with workers getting workflow tasks from queue, never blocking
238+
- workers pick up other tasks that are queued up while waiting for the other tasks to complete
239+
240+
241+
- AWS example
242+
- when starting an EC2 instance
243+
- create network int, EBS FS, etc
244+
245+
246+
247+
10248

11249
20221110 - pydata nyc 2022 - stitchfix feature engineering framework
12250
- book

0 commit comments

Comments
 (0)