Start node-problem-detector on deployed instances to collect memory stats #1523

dconnolly · 2020-12-15T01:54:40Z

Motivation

The current default metrics available in gcloud about our zebrad nodes deployed on VMs don't have metrics about memory usage.

Solution

Add google-monitoring-enabled=true metadata to deployed instances

Enables the Node Problem Detector on Container-Optimized OS, which collects metrics on memory usage, open tcp connections, processes, cpu steal, swap usage, on top of existing host-collected metrics.

Review

Not urgent.

Enables the Node Problem Detector on Container-Optimized OS, which collects metrics on memory usage, open tcp connections, processes, cpu steal, swap usage, on top of existing host-collected metrics.

teor2345

Looks great, and it should help us monitor memory issues like #1486 and #1487

dconnolly · 2020-12-15T16:15:40Z

Unfortunately this metadata flag doesn't seem to get picked up when using any of the create-with-container instance deployment variants, it works fine for plain ol' gcloud compute instances create. I'll play around with options like startup scripts /cloud init

teor2345 · 2020-12-16T04:48:18Z

These look good, but I'm not sure if they actually work, and if they need another review.

Also I think there is a conflict with #1529.

dconnolly · 2020-12-17T22:26:37Z

Moving to draft while I keep trying options that work with the create-with-container command variants: #1523 (comment)

dconnolly · 2020-12-30T20:13:09Z

When deploying containers to containers I just cannot get these memory metrics out. Closing. :/

Add google-monitoring-enabled=true metadata to deployed instances

d923aed

Enables the Node Problem Detector on Container-Optimized OS, which collects metrics on memory usage, open tcp connections, processes, cpu steal, swap usage, on top of existing host-collected metrics.

zfnd-bot bot assigned dconnolly Dec 15, 2020

dconnolly added A-infrastructure Area: Infrastructure changes A-devops Area: Pipelines, CI/CD and Dockerfiles labels Dec 15, 2020

This was referenced Dec 15, 2020

Tune RocksDB memory usage #1486

Closed

Memory leaks #1487

Closed

teor2345 previously approved these changes Dec 15, 2020

View reviewed changes

Add image-family and image-project base image qualifiers

e72b26b

dconnolly dismissed teor2345’s stale review via e72b26b December 15, 2020 15:49

Try cos-dev instable of -stable

d01b2ea

dconnolly marked this pull request as draft December 17, 2020 22:25

dconnolly changed the title ~~Add google-monitoring-enabled=true metadata to deployed instances~~ Start node-problem-detector on deployed instances to collect memory stats Dec 30, 2020

Use startup script to start node-problem-detector

faff2de

dconnolly force-pushed the enable-node-problem-detector branch from 11b53f2 to faff2de Compare December 30, 2020 18:53

Try metadata key too

f1cebb8

dconnolly closed this Dec 30, 2020

dconnolly deleted the enable-node-problem-detector branch December 30, 2020 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start node-problem-detector on deployed instances to collect memory stats #1523

Start node-problem-detector on deployed instances to collect memory stats #1523

dconnolly commented Dec 15, 2020 •

edited

Loading

teor2345 left a comment

dconnolly commented Dec 15, 2020

teor2345 commented Dec 16, 2020

dconnolly commented Dec 17, 2020

dconnolly commented Dec 30, 2020

Start node-problem-detector on deployed instances to collect memory stats #1523

Start node-problem-detector on deployed instances to collect memory stats #1523

Conversation

dconnolly commented Dec 15, 2020 • edited Loading

Motivation

Solution

Review

teor2345 left a comment

Choose a reason for hiding this comment

dconnolly commented Dec 15, 2020

teor2345 commented Dec 16, 2020

dconnolly commented Dec 17, 2020

dconnolly commented Dec 30, 2020

dconnolly commented Dec 15, 2020 •

edited

Loading