Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
244 changes: 244 additions & 0 deletions Notes/ML/2020mlScratchPad.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,248 @@

20221110 - pydata nyc 2022 - workflow engines by sanjay from Akasa
- akasa company
- number 1 cause of bankruptcy is medical bills
- $266 billion is wasteful admin medical work
- company tries to automate away this work

- use ML to automate tasks


20221110 - pydata nyc 2022 - stitchfix feature engineering framework
- book
- marshall goldsmith - what got you here won't get you there

- https://github.com/stitchfix/hamilton

- hamulton framework
- has a visualization plug in to view DAG it creates
- can run haliton on spark, dask, ray



20221109 - pydata nyc 2022 - shiny fpr python
- libs
- astropy -> for astronomy watchong

- but what abput streamlit?
- streamlit rexecutes verything each interaction
- some clver caching
- great for simple or small apps
- moderately ambitious apps, becomes liability

- shiny for ambitious apps

- https://shiny.rstudio.com/py


- https://github.com/jcheng5/PyDataNYC2022-demos





20221109 - pydata nyc 2022 - graph based ML ransomware detection
- ransomware
- phishing
- software bugs
- brute force credential attack

- $1.2 billion in 2021 vs $416 million in 2020
- conti group biggest actor, $25 million per attack

- data
- use 32 threads to encrypt
- speed over stealth
- ransomware will be in network for a month, so have 30 days to fix before issues
- once on computer, you can see what other accounts have beenused on that computer
- mysticpy - security tool developed by M$
- usually with backup pivileges, priveled acct
- can start to do damge from here

- examp,es
- irish healthcare
- $100 million cost to get back up
- didn't pay ransom
- some peolle use pen/paper for 35 days

- approach
- connect events as graph
- calculate centraility

- code
i,port networkx as nx
g = nx.from_pandas_edgelist(df, source, target)
nx.draw(g, ith_labels=True, font-size 1)
nx.show()

nx.closeness_centrality()

- Msticpy library
- productionize wityh streaming graph db -> surrealdb

- new surrealdb product to look at

- recommendations
- multi factor auth
- identity na;ytics (acct creation, permissions)
- response proces to security logging, monitoring, alerts
- patch management
- supply chain












- met william, ex LMT worker, air force vet, philly based
- met matt, speaker, texas based

20221109 - pydata nyc 2022 - westher impact models
- why predict weather
- crops, stores, transportationm, aviation, predict product mix
- reduce utility downtime
- can schwedule crews ahead of time

- data collection
- weayher observations
- infrastructure density
- seasonal foliage, vegetation

- feature engineering/feature creation

- for utilities
- need to predict multip,e times as storm path is being updated
- model preedictas number of outt=ges predicted in the next 72 hours
- weather data is at 4x4 km granularity

- for creating trai i g data
- outages rolled up for 244 hour
- aggregate weather for t24 hours
- use pyspark as engine

- ,model features
- wind - max, avg, min, max wind gust, min wind gust
- precipitation, snow, ice - max , avg cumulative, ice, snow density
- tems - max, evg, avg wet bulb temp
- infrastcuture - length of line, ...

- problem
- data quality - hignh outages recorded past event
- due to maintenenece after strim, needed
- lag in outage report as people come back/get service back
- very common for big storms
- data confusing to model, so remove the long tail from model
- also if lots of outage but not weather related
- so not needed for model, so remove outage calls
- outage not weather deependent
- low outages, significant weather
- removed

- outage data challenges
- wide range and sparsity
- extreme weatyher rarely happens, so sparse

- models
- best model is random forest
- routine false alarm days measurement
- use pareto boundary analysis for model selection
- clients are mkre intersted in mdeium storm prediction than large storm prediction

- explainning predicted oputtages
- shapley values (shap)
- helps to determine what each feature contributes to the prediction
- need to remove correlated features before training
- or else SHAP test will be skewed
- local interpretable model agnostic explanations (LIME)
- contrasctive analysis
- sensitivity analysis
- tree analysis (to tune for tree depth)
- to help identify the cause of overprediction
- remove events/training daya to created the bad tree predictions

- scoring models using forecast weather data
- using pyspark to aggregate data for 24/48 and 72 hours
- PMML model

- look up
- PMML model
- https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language








20221109 - pydata nyc 2022 - ML system deployment with kubernetes
- prod architecture
- kubernetes is a coker container management system
- can use cluster ip service to ping individual pods
- even if they hare a VM/IP
- how to expose kubernetes service to outside
- use nodeport service
- can set http request to any workers
- loadbalancer service creates a loadbaalncer outside kubernetes
- it balances across nodes/VM, not pods as kubernetes does
- knative serv ice
- does automatic scaling,, creates pods as needed


- resources
- eatch kubernetes youtube video by honeypot
- alternatives to kubernetes
- mazos, swarm, dockerflow?



20221109 - pydata nyc shopify ML
- shopify system
- catalog management
- product categorization
- fraud protection
- customer acquisition
- invetory
- saless
- finance
- sales
- POS
- p

- models
- multi lingual bert for text
- mobilez net v2

- model arch
= bert and mobilenet in parrallel
- get text embeddings and image embeddings
- feed both into perceptron layer

- inference archticture
- cache the embeddings of text and images as they come through
- then send to the next layer
- that way, can look up cache and don't need to run embeddings again
- get savings in computer time and VM perspective

- how to measure performance and impact
- also measure how used by merchants

- ml for custoemr acquisition
- go from product categories -> buyer pool -> audience set from ML -> ad platform
- use FB and Google ad, but treplace their targeting engines

- attribute extraction for prodyucst
- probelm: attributes are specific to product categhores
- had domain experts annotatee product categories

-


20201223 - pipeline work
- wrote test pipeline script using CA housing data
Expand Down
2 changes: 1 addition & 1 deletion Notes/Requirements/pySparkRequirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
pkg-resources==0.0.0
py4j==0.10.9
pyspark==3.0.0
pyspark==3.2.2
2 changes: 1 addition & 1 deletion Notes/Requirements/tf2_37.txt
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ pyasn1-modules==0.2.8
Pygments==2.6.1
pyparsing==2.4.7
pyrsistent==0.15.7
pyspark==2.4.5
pyspark==3.2.2
python-dateutil==2.8.1
pytz==2019.3
pyzmq==19.0.0
Expand Down