From e3130d6068491e6b2aca7ead6ac1137c10160a55 Mon Sep 17 00:00:00 2001
From: Marc Duby <mduby@wm4ac-cf8.broadinstitute.org>
Date: Wed, 9 Nov 2022 11:42:12 -0500
Subject: [PATCH 1/4] notes: adding pydata nyc 2022 day 01 talks

---
 Notes/ML/2020mlScratchPad.txt | 66 +++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Notes/ML/2020mlScratchPad.txt b/Notes/ML/2020mlScratchPad.txt
index 012252f..e6bc4e3 100644
--- a/Notes/ML/2020mlScratchPad.txt
+++ b/Notes/ML/2020mlScratchPad.txt
@@ -1,5 +1,71 @@
 
 
+
+
+
+20221109 - pydata nyc 2022 
+- prod architecture 
+  - kubernetes is a coker container management system 
+  - can use cluster ip service to ping individual pods
+    - even if they hare a VM/IP 
+  - how to expose kubernetes service to outside 
+    - use nodeport service 
+    - can set http request to any workers 
+  - loadbalancer service creates a loadbaalncer outside kubernetes 
+    - it balances across nodes/VM, not pods as kubernetes does 
+  - knative serv ice 
+    - does automatic scaling,, creates pods as needed 
+
+
+- resources 
+  - eatch kubernetes youtube video by honeypot 
+  - alternatives to kubernetes 
+    - mazos, swarm, dockerflow? 
+
+
+
+20221109 - pydata nyc shopify ML 
+- shopify system
+  - catalog management
+    - product categorization 
+  - fraud protection
+  - customer acquisition 
+  - invetory
+  - saless 
+  - finance 
+  - sales 
+  - POS 
+  - p
+
+- models 
+  - multi lingual bert for text 
+  - mobilez net v2 
+
+- model arch 
+  = bert and mobilenet in parrallel
+    - get text embeddings and image embeddings
+  - feed both into perceptron layer 
+
+- inference archticture 
+  - cache the embeddings of text and images as they come through
+  - then send to the next layer 
+  - that way, can look up cache and don't need to run embeddings again 
+  - get savings in computer time and VM perspective 
+
+- how to measure performance and impact 
+  - also measure how used by merchants 
+
+- ml for custoemr acquisition 
+  - go from product categories -> buyer pool -> audience set from ML -> ad platform 
+  - use FB and Google ad, but treplace their targeting engines 
+
+- attribute extraction for prodyucst
+  - probelm: attributes are specific to product categhores 
+  - had domain  experts annotatee product categories 
+
+- 
+
+
 20201223 - pipeline work
 - wrote test pipeline script using CA housing data
   - standard scalar didn't help with R2 score 

From 137965b11ee26b47a68c2a2fe30e69f82da6f333 Mon Sep 17 00:00:00 2001
From: Marc Duby <mduby@wm4ac-cf8.broadinstitute.org>
Date: Wed, 9 Nov 2022 16:03:45 -0500
Subject: [PATCH 2/4] notes: adding pydata nyc 2022 day 01 afternoon talks

---
 Notes/ML/2020mlScratchPad.txt | 139 +++++++++++++++++++++++++++++++++-
 1 file changed, 138 insertions(+), 1 deletion(-)

diff --git a/Notes/ML/2020mlScratchPad.txt b/Notes/ML/2020mlScratchPad.txt
index e6bc4e3..8fb50cc 100644
--- a/Notes/ML/2020mlScratchPad.txt
+++ b/Notes/ML/2020mlScratchPad.txt
@@ -1,9 +1,146 @@
 
 
+20221109 - pydata nyc 2022 - graph based ML ransomware detection 
+- ransomware
+  - phishing
+  - software bugs 
+  - brute force credential attack 
 
+- $1.2 billion in 2021 vs $416 million in 2020
+- conti group biggest actor, $25 million per attack 
 
+- data
+  - use 32 threads to encrypt 
+    - speed over stealth 
+  - ransomware will be in network for a month, so have 30 days to fix before issues 
+  - once on computer, you can see what other accounts have beenused on that computer 
+  - mysticpy - security tool developed by M$
+  - usually with backup pivileges, priveled acct 
+    - can start to do damge from here 
 
-20221109 - pydata nyc 2022 
+- examp,es 
+  - irish healthcare 
+    - $100 million cost to get back up
+    - didn't pay ransom 
+    - some peolle use pen/paper for 35 days 
+
+- approach 
+  - connect events as graph 
+  - calculate centraility 
+
+- code 
+  i,port networkx as nx 
+  g = nx.from_pandas_edgelist(df, source, target)
+  nx.draw(g, ith_labels=True, font-size  1)
+  nx.show()
+
+  nx.closeness_centrality()
+
+- Msticpy library 
+  - productionize wityh streaming graph db -> surrealdb
+
+- new surrealdb product to look at 
+
+- recommendations 
+  - multi factor auth 
+  - identity na;ytics (acct creation, permissions)
+  - response proces to security logging, monitoring, alerts 
+  - patch management 
+  - supply chain 
+
+
+
+
+
+
+
+
+
+
+
+
+- met william, ex LMT worker, air force vet, philly based 
+  - met matt, speaker, texas based 
+
+20221109 - pydata nyc 2022 - westher impact models 
+- why predict weather 
+  - crops, stores, transportationm, aviation, predict product mix 
+  - reduce utility downtime 
+    - can schwedule crews ahead of time 
+
+- data collection 
+  - weayher observations 
+  - infrastructure density 
+  - seasonal foliage, vegetation
+
+- feature engineering/feature creation 
+
+- for utilities 
+  - need to predict multip,e times as storm path is being updated 
+  - model preedictas number of outt=ges predicted in the next 72 hours 
+  - weather data is at 4x4 km granularity 
+
+- for creating trai i g data 
+  - outages rolled up for 244 hour 
+  - aggregate weather for t24 hours 
+  - use pyspark as engine 
+
+- ,model features 
+  - wind - max, avg, min, max wind gust, min wind gust 
+  - precipitation, snow, ice - max , avg cumulative, ice, snow density 
+  - tems - max, evg, avg wet bulb temp 
+  - infrastcuture - length of line, ...
+
+- problem 
+  - data quality - hignh outages recorded past event 
+    - due to maintenenece after strim, needed 
+    - lag in outage report as people come back/get service back 
+    - very common for big storms 
+    - data confusing to model, so remove the long tail from model 
+  - also if lots of outage but not weather related 
+    - so not needed for model, so remove outage calls 
+    - outage not weather deependent 
+  - low outages, significant weather 
+    - removed 
+
+- outage data challenges 
+  - wide range and sparsity 
+  - extreme weatyher rarely happens, so sparse 
+
+- models 
+  - best model is random forest 
+  - routine false alarm days measurement 
+  - use pareto boundary analysis for model selection 
+  - clients are mkre intersted in mdeium storm prediction than large storm prediction 
+
+- explainning predicted oputtages 
+  - shapley values (shap)
+    - helps to determine what each feature contributes to the prediction 
+    - need to remove correlated features before training 
+      - or else SHAP test will be skewed 
+  - local interpretable model agnostic explanations (LIME)
+  - contrasctive analysis 
+  - sensitivity analysis
+  - tree analysis (to tune for tree depth)
+    - to help identify the cause of overprediction 
+    - remove events/training daya to created the bad tree predictions 
+
+- scoring models using forecast weather data 
+  - using pyspark to aggregate data for 24/48 and 72 hours 
+  - PMML model 
+
+- look up 
+  - PMML model 
+    - https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language
+
+
+
+
+
+
+
+
+20221109 - pydata nyc 2022 - ML system deployment with kubernetes
 - prod architecture 
   - kubernetes is a coker container management system 
   - can use cluster ip service to ping individual pods

From 9ead65a3dd3fcbf2e878e519e5eae83bff985843 Mon Sep 17 00:00:00 2001
From: Marc Duby <mduby@wm4ac-cf8.broadinstitute.org>
Date: Thu, 10 Nov 2022 11:55:42 -0500
Subject: [PATCH 3/4] notes: adding pydata nyc 2022 day 02 morning talks

---
 Notes/ML/2020mlScratchPad.txt | 41 +++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/Notes/ML/2020mlScratchPad.txt b/Notes/ML/2020mlScratchPad.txt
index 8fb50cc..e7afc78 100644
--- a/Notes/ML/2020mlScratchPad.txt
+++ b/Notes/ML/2020mlScratchPad.txt
@@ -1,4 +1,45 @@
 
+20221110 - pydata nyc 2022 - workflow engines by sanjay from Akasa
+- akasa company 
+  - number 1 cause of bankruptcy is medical bills 
+  - $266 billion is wasteful admin medical work
+  - company tries to automate away this work
+
+- use ML to automate tasks 
+
+
+20221110 - pydata nyc 2022 - stitchfix feature engineering framework
+- book
+  - marshall goldsmith - what got you here won't get you there 
+
+- https://github.com/stitchfix/hamilton
+
+- hamulton framework
+  - has a visualization plug in to view DAG it creates 
+  - can run haliton on spark, dask, ray 
+
+
+
+20221109 - pydata nyc 2022 - shiny fpr python 
+- libs 
+  - astropy -> for astronomy watchong  
+
+- but what abput streamlit?
+  - streamlit rexecutes verything each interaction
+  - some clver caching 
+  - great for simple or small apps 
+  - moderately ambitious apps, becomes liability 
+
+- shiny for ambitious apps 
+
+- https://shiny.rstudio.com/py 
+
+
+- https://github.com/jcheng5/PyDataNYC2022-demos
+
+
+
+
 
 20221109 - pydata nyc 2022 - graph based ML ransomware detection 
 - ransomware

From f3fa974b846b6611895c804eb41a4fd006d7f1b2 Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Fri, 11 Nov 2022 07:51:05 +0000
Subject: [PATCH 4/4] build(deps): bump pyspark from 2.4.5 to 3.2.2 in
 /Notes/Requirements

Bumps [pyspark](https://github.com/apache/spark) from 2.4.5 to 3.2.2.
- [Release notes](https://github.com/apache/spark/releases)
- [Commits](https://github.com/apache/spark/compare/v2.4.5...v3.2.2)

---
updated-dependencies:
- dependency-name: pyspark
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
---
 Notes/Requirements/pySparkRequirements.txt | 2 +-
 Notes/Requirements/tf2_37.txt              | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Notes/Requirements/pySparkRequirements.txt b/Notes/Requirements/pySparkRequirements.txt
index c02ce7c..f2291ae 100644
--- a/Notes/Requirements/pySparkRequirements.txt
+++ b/Notes/Requirements/pySparkRequirements.txt
@@ -1,3 +1,3 @@
 pkg-resources==0.0.0
 py4j==0.10.9
-pyspark==3.0.0
+pyspark==3.2.2
diff --git a/Notes/Requirements/tf2_37.txt b/Notes/Requirements/tf2_37.txt
index f91b1ea..bc803fd 100644
--- a/Notes/Requirements/tf2_37.txt
+++ b/Notes/Requirements/tf2_37.txt
@@ -63,7 +63,7 @@ pyasn1-modules==0.2.8
 Pygments==2.6.1
 pyparsing==2.4.7
 pyrsistent==0.15.7
-pyspark==2.4.5
+pyspark==3.2.2
 python-dateutil==2.8.1
 pytz==2019.3
 pyzmq==19.0.0