marcduby · dependabot · Nov 9, 2022 · Nov 9, 2022 · Nov 10, 2022 · Nov 11, 2022
diff --git a/Notes/ML/2020mlScratchPad.txt b/Notes/ML/2020mlScratchPad.txt
@@ -1,4 +1,248 @@
 
+20221110 - pydata nyc 2022 - workflow engines by sanjay from Akasa
+- akasa company 
+  - number 1 cause of bankruptcy is medical bills 
+  - $266 billion is wasteful admin medical work
+  - company tries to automate away this work
+
+- use ML to automate tasks 
+
+
+20221110 - pydata nyc 2022 - stitchfix feature engineering framework
+- book
+  - marshall goldsmith - what got you here won't get you there 
+
+- https://github.com/stitchfix/hamilton
+
+- hamulton framework
+  - has a visualization plug in to view DAG it creates 
+  - can run haliton on spark, dask, ray 
+
+
+
+20221109 - pydata nyc 2022 - shiny fpr python 
+- libs 
+  - astropy -> for astronomy watchong  
+
+- but what abput streamlit?
+  - streamlit rexecutes verything each interaction
+  - some clver caching 
+  - great for simple or small apps 
+  - moderately ambitious apps, becomes liability 
+
+- shiny for ambitious apps 
+
+- https://shiny.rstudio.com/py 
+
+
+- https://github.com/jcheng5/PyDataNYC2022-demos
+
+
+
+
+
+20221109 - pydata nyc 2022 - graph based ML ransomware detection 
+- ransomware
+  - phishing
+  - software bugs 
+  - brute force credential attack 
+
+- $1.2 billion in 2021 vs $416 million in 2020
+- conti group biggest actor, $25 million per attack 
+
+- data
+  - use 32 threads to encrypt 
+    - speed over stealth 
+  - ransomware will be in network for a month, so have 30 days to fix before issues 
+  - once on computer, you can see what other accounts have beenused on that computer 
+  - mysticpy - security tool developed by M$
+  - usually with backup pivileges, priveled acct 
+    - can start to do damge from here 
+
+- examp,es 
+  - irish healthcare 
+    - $100 million cost to get back up
+    - didn't pay ransom 
+    - some peolle use pen/paper for 35 days 
+
+- approach 
+  - connect events as graph 
+  - calculate centraility 
+
+- code 
+  i,port networkx as nx 
+  g = nx.from_pandas_edgelist(df, source, target)
+  nx.draw(g, ith_labels=True, font-size  1)
+  nx.show()
+
+  nx.closeness_centrality()
+
+- Msticpy library 
+  - productionize wityh streaming graph db -> surrealdb
+
+- new surrealdb product to look at 
+
+- recommendations 
+  - multi factor auth 
+  - identity na;ytics (acct creation, permissions)
+  - response proces to security logging, monitoring, alerts 
+  - patch management 
+  - supply chain 
+
+
+
+
+
+
+
+
+
+
+
+
+- met william, ex LMT worker, air force vet, philly based 
+  - met matt, speaker, texas based 
+
+20221109 - pydata nyc 2022 - westher impact models 
+- why predict weather 
+  - crops, stores, transportationm, aviation, predict product mix 
+  - reduce utility downtime 
+    - can schwedule crews ahead of time 
+
+- data collection 
+  - weayher observations 
+  - infrastructure density 
+  - seasonal foliage, vegetation
+
+- feature engineering/feature creation 
+
+- for utilities 
+  - need to predict multip,e times as storm path is being updated 
+  - model preedictas number of outt=ges predicted in the next 72 hours 
+  - weather data is at 4x4 km granularity 
+
+- for creating trai i g data 
+  - outages rolled up for 244 hour 
+  - aggregate weather for t24 hours 
+  - use pyspark as engine 
+
+- ,model features 
+  - wind - max, avg, min, max wind gust, min wind gust 
+  - precipitation, snow, ice - max , avg cumulative, ice, snow density 
+  - tems - max, evg, avg wet bulb temp 
+  - infrastcuture - length of line, ...
+
+- problem 
+  - data quality - hignh outages recorded past event 
+    - due to maintenenece after strim, needed 
+    - lag in outage report as people come back/get service back 
+    - very common for big storms 
+    - data confusing to model, so remove the long tail from model 
+  - also if lots of outage but not weather related 
+    - so not needed for model, so remove outage calls 
+    - outage not weather deependent 
+  - low outages, significant weather 
+    - removed 
+
+- outage data challenges 
+  - wide range and sparsity 
+  - extreme weatyher rarely happens, so sparse 
+
+- models 
+  - best model is random forest 
+  - routine false alarm days measurement 
+  - use pareto boundary analysis for model selection 
+  - clients are mkre intersted in mdeium storm prediction than large storm prediction 
+
+- explainning predicted oputtages 
+  - shapley values (shap)
+    - helps to determine what each feature contributes to the prediction 
+    - need to remove correlated features before training 
+      - or else SHAP test will be skewed 
+  - local interpretable model agnostic explanations (LIME)
+  - contrasctive analysis 
+  - sensitivity analysis
+  - tree analysis (to tune for tree depth)
+    - to help identify the cause of overprediction 
+    - remove events/training daya to created the bad tree predictions 
+
+- scoring models using forecast weather data 
+  - using pyspark to aggregate data for 24/48 and 72 hours 
+  - PMML model 
+
+- look up 
+  - PMML model 
+    - https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language
+
+
+
+
+
+
+
+
+20221109 - pydata nyc 2022 - ML system deployment with kubernetes
+- prod architecture 
+  - kubernetes is a coker container management system 
+  - can use cluster ip service to ping individual pods
+    - even if they hare a VM/IP 
+  - how to expose kubernetes service to outside 
+    - use nodeport service 
+    - can set http request to any workers 
+  - loadbalancer service creates a loadbaalncer outside kubernetes 
+    - it balances across nodes/VM, not pods as kubernetes does 
+  - knative serv ice 
+    - does automatic scaling,, creates pods as needed 
+
+
+- resources 
+  - eatch kubernetes youtube video by honeypot 
+  - alternatives to kubernetes 
+    - mazos, swarm, dockerflow? 
+
+
+
+20221109 - pydata nyc shopify ML 
+- shopify system
+  - catalog management
+    - product categorization 
+  - fraud protection
+  - customer acquisition 
+  - invetory
+  - saless 
+  - finance 
+  - sales 
+  - POS 
+  - p
+
+- models 
+  - multi lingual bert for text 
+  - mobilez net v2 
+
+- model arch 
+  = bert and mobilenet in parrallel
+    - get text embeddings and image embeddings
+  - feed both into perceptron layer 
+
+- inference archticture 
+  - cache the embeddings of text and images as they come through
+  - then send to the next layer 
+  - that way, can look up cache and don't need to run embeddings again 
+  - get savings in computer time and VM perspective 
+
+- how to measure performance and impact 
+  - also measure how used by merchants 
+
+- ml for custoemr acquisition 
+  - go from product categories -> buyer pool -> audience set from ML -> ad platform 
+  - use FB and Google ad, but treplace their targeting engines 
+
+- attribute extraction for prodyucst
+  - probelm: attributes are specific to product categhores 
+  - had domain  experts annotatee product categories 
+
+- 
+
 
 20201223 - pipeline work
 - wrote test pipeline script using CA housing data

diff --git a/Notes/Requirements/pySparkRequirements.txt b/Notes/Requirements/pySparkRequirements.txt
@@ -1,3 +1,3 @@
 pkg-resources==0.0.0
 py4j==0.10.9
-pyspark==3.0.0
+pyspark==3.2.2
diff --git a/Notes/Requirements/tf2_37.txt b/Notes/Requirements/tf2_37.txt
@@ -63,7 +63,7 @@ pyasn1-modules==0.2.8
 Pygments==2.6.1
 pyparsing==2.4.7
 pyrsistent==0.15.7
-pyspark==2.4.5
+pyspark==3.2.2
 python-dateutil==2.8.1
 pytz==2019.3
 pyzmq==19.0.0