From e3130d6068491e6b2aca7ead6ac1137c10160a55 Mon Sep 17 00:00:00 2001 From: Marc Duby Date: Wed, 9 Nov 2022 11:42:12 -0500 Subject: [PATCH 1/4] notes: adding pydata nyc 2022 day 01 talks --- Notes/ML/2020mlScratchPad.txt | 66 +++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/Notes/ML/2020mlScratchPad.txt b/Notes/ML/2020mlScratchPad.txt index 012252f..e6bc4e3 100644 --- a/Notes/ML/2020mlScratchPad.txt +++ b/Notes/ML/2020mlScratchPad.txt @@ -1,5 +1,71 @@ + + + +20221109 - pydata nyc 2022 +- prod architecture + - kubernetes is a coker container management system + - can use cluster ip service to ping individual pods + - even if they hare a VM/IP + - how to expose kubernetes service to outside + - use nodeport service + - can set http request to any workers + - loadbalancer service creates a loadbaalncer outside kubernetes + - it balances across nodes/VM, not pods as kubernetes does + - knative serv ice + - does automatic scaling,, creates pods as needed + + +- resources + - eatch kubernetes youtube video by honeypot + - alternatives to kubernetes + - mazos, swarm, dockerflow? + + + +20221109 - pydata nyc shopify ML +- shopify system + - catalog management + - product categorization + - fraud protection + - customer acquisition + - invetory + - saless + - finance + - sales + - POS + - p + +- models + - multi lingual bert for text + - mobilez net v2 + +- model arch + = bert and mobilenet in parrallel + - get text embeddings and image embeddings + - feed both into perceptron layer + +- inference archticture + - cache the embeddings of text and images as they come through + - then send to the next layer + - that way, can look up cache and don't need to run embeddings again + - get savings in computer time and VM perspective + +- how to measure performance and impact + - also measure how used by merchants + +- ml for custoemr acquisition + - go from product categories -> buyer pool -> audience set from ML -> ad platform + - use FB and Google ad, but treplace their targeting engines + +- attribute extraction for prodyucst + - probelm: attributes are specific to product categhores + - had domain experts annotatee product categories + +- + + 20201223 - pipeline work - wrote test pipeline script using CA housing data - standard scalar didn't help with R2 score From 137965b11ee26b47a68c2a2fe30e69f82da6f333 Mon Sep 17 00:00:00 2001 From: Marc Duby Date: Wed, 9 Nov 2022 16:03:45 -0500 Subject: [PATCH 2/4] notes: adding pydata nyc 2022 day 01 afternoon talks --- Notes/ML/2020mlScratchPad.txt | 139 +++++++++++++++++++++++++++++++++- 1 file changed, 138 insertions(+), 1 deletion(-) diff --git a/Notes/ML/2020mlScratchPad.txt b/Notes/ML/2020mlScratchPad.txt index e6bc4e3..8fb50cc 100644 --- a/Notes/ML/2020mlScratchPad.txt +++ b/Notes/ML/2020mlScratchPad.txt @@ -1,9 +1,146 @@ +20221109 - pydata nyc 2022 - graph based ML ransomware detection +- ransomware + - phishing + - software bugs + - brute force credential attack +- $1.2 billion in 2021 vs $416 million in 2020 +- conti group biggest actor, $25 million per attack +- data + - use 32 threads to encrypt + - speed over stealth + - ransomware will be in network for a month, so have 30 days to fix before issues + - once on computer, you can see what other accounts have beenused on that computer + - mysticpy - security tool developed by M$ + - usually with backup pivileges, priveled acct + - can start to do damge from here -20221109 - pydata nyc 2022 +- examp,es + - irish healthcare + - $100 million cost to get back up + - didn't pay ransom + - some peolle use pen/paper for 35 days + +- approach + - connect events as graph + - calculate centraility + +- code + i,port networkx as nx + g = nx.from_pandas_edgelist(df, source, target) + nx.draw(g, ith_labels=True, font-size 1) + nx.show() + + nx.closeness_centrality() + +- Msticpy library + - productionize wityh streaming graph db -> surrealdb + +- new surrealdb product to look at + +- recommendations + - multi factor auth + - identity na;ytics (acct creation, permissions) + - response proces to security logging, monitoring, alerts + - patch management + - supply chain + + + + + + + + + + + + +- met william, ex LMT worker, air force vet, philly based + - met matt, speaker, texas based + +20221109 - pydata nyc 2022 - westher impact models +- why predict weather + - crops, stores, transportationm, aviation, predict product mix + - reduce utility downtime + - can schwedule crews ahead of time + +- data collection + - weayher observations + - infrastructure density + - seasonal foliage, vegetation + +- feature engineering/feature creation + +- for utilities + - need to predict multip,e times as storm path is being updated + - model preedictas number of outt=ges predicted in the next 72 hours + - weather data is at 4x4 km granularity + +- for creating trai i g data + - outages rolled up for 244 hour + - aggregate weather for t24 hours + - use pyspark as engine + +- ,model features + - wind - max, avg, min, max wind gust, min wind gust + - precipitation, snow, ice - max , avg cumulative, ice, snow density + - tems - max, evg, avg wet bulb temp + - infrastcuture - length of line, ... + +- problem + - data quality - hignh outages recorded past event + - due to maintenenece after strim, needed + - lag in outage report as people come back/get service back + - very common for big storms + - data confusing to model, so remove the long tail from model + - also if lots of outage but not weather related + - so not needed for model, so remove outage calls + - outage not weather deependent + - low outages, significant weather + - removed + +- outage data challenges + - wide range and sparsity + - extreme weatyher rarely happens, so sparse + +- models + - best model is random forest + - routine false alarm days measurement + - use pareto boundary analysis for model selection + - clients are mkre intersted in mdeium storm prediction than large storm prediction + +- explainning predicted oputtages + - shapley values (shap) + - helps to determine what each feature contributes to the prediction + - need to remove correlated features before training + - or else SHAP test will be skewed + - local interpretable model agnostic explanations (LIME) + - contrasctive analysis + - sensitivity analysis + - tree analysis (to tune for tree depth) + - to help identify the cause of overprediction + - remove events/training daya to created the bad tree predictions + +- scoring models using forecast weather data + - using pyspark to aggregate data for 24/48 and 72 hours + - PMML model + +- look up + - PMML model + - https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language + + + + + + + + +20221109 - pydata nyc 2022 - ML system deployment with kubernetes - prod architecture - kubernetes is a coker container management system - can use cluster ip service to ping individual pods From 9ead65a3dd3fcbf2e878e519e5eae83bff985843 Mon Sep 17 00:00:00 2001 From: Marc Duby Date: Thu, 10 Nov 2022 11:55:42 -0500 Subject: [PATCH 3/4] notes: adding pydata nyc 2022 day 02 morning talks --- Notes/ML/2020mlScratchPad.txt | 41 +++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/Notes/ML/2020mlScratchPad.txt b/Notes/ML/2020mlScratchPad.txt index 8fb50cc..e7afc78 100644 --- a/Notes/ML/2020mlScratchPad.txt +++ b/Notes/ML/2020mlScratchPad.txt @@ -1,4 +1,45 @@ +20221110 - pydata nyc 2022 - workflow engines by sanjay from Akasa +- akasa company + - number 1 cause of bankruptcy is medical bills + - $266 billion is wasteful admin medical work + - company tries to automate away this work + +- use ML to automate tasks + + +20221110 - pydata nyc 2022 - stitchfix feature engineering framework +- book + - marshall goldsmith - what got you here won't get you there + +- https://github.com/stitchfix/hamilton + +- hamulton framework + - has a visualization plug in to view DAG it creates + - can run haliton on spark, dask, ray + + + +20221109 - pydata nyc 2022 - shiny fpr python +- libs + - astropy -> for astronomy watchong + +- but what abput streamlit? + - streamlit rexecutes verything each interaction + - some clver caching + - great for simple or small apps + - moderately ambitious apps, becomes liability + +- shiny for ambitious apps + +- https://shiny.rstudio.com/py + + +- https://github.com/jcheng5/PyDataNYC2022-demos + + + + 20221109 - pydata nyc 2022 - graph based ML ransomware detection - ransomware From f3fa974b846b6611895c804eb41a4fd006d7f1b2 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 11 Nov 2022 07:51:05 +0000 Subject: [PATCH 4/4] build(deps): bump pyspark from 2.4.5 to 3.2.2 in /Notes/Requirements Bumps [pyspark](https://github.com/apache/spark) from 2.4.5 to 3.2.2. - [Release notes](https://github.com/apache/spark/releases) - [Commits](https://github.com/apache/spark/compare/v2.4.5...v3.2.2) --- updated-dependencies: - dependency-name: pyspark dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- Notes/Requirements/pySparkRequirements.txt | 2 +- Notes/Requirements/tf2_37.txt | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Notes/Requirements/pySparkRequirements.txt b/Notes/Requirements/pySparkRequirements.txt index c02ce7c..f2291ae 100644 --- a/Notes/Requirements/pySparkRequirements.txt +++ b/Notes/Requirements/pySparkRequirements.txt @@ -1,3 +1,3 @@ pkg-resources==0.0.0 py4j==0.10.9 -pyspark==3.0.0 +pyspark==3.2.2 diff --git a/Notes/Requirements/tf2_37.txt b/Notes/Requirements/tf2_37.txt index f91b1ea..bc803fd 100644 --- a/Notes/Requirements/tf2_37.txt +++ b/Notes/Requirements/tf2_37.txt @@ -63,7 +63,7 @@ pyasn1-modules==0.2.8 Pygments==2.6.1 pyparsing==2.4.7 pyrsistent==0.15.7 -pyspark==2.4.5 +pyspark==3.2.2 python-dateutil==2.8.1 pytz==2019.3 pyzmq==19.0.0