Instability and ramping resource usage #337

benjimin · 2021-11-04T05:45:17Z

Lately have been observing some instability in this explorer deployment.

There is sawtooth ramping of the CPU utilisation, memory utilisation, and transmit bandwidth. That is, each new pod steadily increases both its CPU utilisation and its transmit bandwidth over the course of roughly 24 hours, and then the pods are collectively replaced. The stats peak at roughly 75% CPU utilisation, and 5MB/s egress per pod. The receive bandwidth is not ramping (maintaining a fairly consistent 0.1MB/s per pod).

Also (perhaps unrelated) there have been frequent failures from the front page (most commonly 502, 503, and timeouts) according to monitoring by uptime robot.

benjimin · 2021-11-09T04:47:53Z

One evident problem is an unbounded set of labels applied to prometheus metrics. (As noted in a recent datacube-wps issue, metrics are not for logs.)

flask_http_request_duration_seconds_sum{method="GET",path="/product/high_tide_comp_20p/regions/6_-41",status="404"} 0.052539687836542726
flask_http_request_duration_seconds_sum{method="GET",path="/products/wofs_annual_summary/datasets/17c40f41-65d4-4e43-968b-5cf5d953fed6",status="200"} 0.1782301519997418
flask_http_request_duration_seconds_sum{method="GET",path="/dataset/0dcdef87-b507-42dc-a69b-c570501068ca.odc-metadata.yaml",status="200"} 0.031694565899670124
flask_http_request_duration_seconds_sum{method="GET",path="/products/ls7_fc_albers/datasets/6b772d4a-7c00-46b9-8b1a-38e4d95c3380",status="200"} 0.14692931901663542
flask_http_request_duration_seconds_sum{method="GET",path="/products/ga_ls5t_ard_3/datasets/379642d8-6ebe-4b55-9c71-92e345567dd5",status="200"} 0.11339368112385273
flask_http_request_duration_seconds_sum{method="GET",path="/dataset/65f34855-7322-4cf2-9a93-cffd12639cf7.odc-metadata.yaml",status="200"} 0.017415384063497186
flask_http_request_duration_seconds_sum{method="GET",path="/products/ga_ls_wo_3/datasets/e826f451-85c5-5aa4-ab0f-8776ce908b88",status="200"} 0.06057134480215609

Currently (at a low point in the sawtooth: only about 10% CPU utilisation) curl localhost:8080/metrics returns 30,000+ metrics and takes ~0.7 seconds. As the container serves more requests for information about different datasets, this list will continue to grow (and so incessant prometheus scraping will get slower i.e. more demanding). Scrapes are initiated every 10s.

A few hours later: 70k+ metrics, taking ~1.6s per request, response output 11MB (i.e. uncompressed ~1MB/s per pod). Prometheus graphs suggest the label count may be 5x worse in about 12hrs. This explains the bandwidth ramping.

Confirmed that hammering the metric endpoint (with consecutive requests) can spike CPU usage toward 100%.

benjimin · 2021-11-09T21:48:21Z

After another several hours, those pods are at 70% CPU with response times for the metric endpoint exceeding 10s and returning 300k metrics (43MB).

Another deployment ("nci" rather than "prod") that was not exhibiting obvious resource-ramping but was still exhibiting eventual complete failure (e.g. liveliness probe failures, hundreds of container restarts overnight, leading to constant 502/503 errors from the aggregated endpoint i.e. bad gateway or service unavailable - until the pods are manually replaced each day) had accumulated ~330k metrics (46MB) on individual pods.

The prometheus server status page is also highlighting the issue from another side: e.g. ~800k metric names for flask_http_request_duration_seconds_bucket (the histogram buckets, plus another 50k+ each for the flask overall duration sum and request count) and over 3GB of memory consumed for just the path label alone.

benjimin · 2021-11-11T00:57:36Z

Note the pod consistently starts failing if the number of metrics reaches 320,000-340,000. (By then, scrape durations are also approaching the scraping interval, and the CPU is heavily consumed solely with outputting metrics data.) Generally the pod starts failing liveness probes and gets stuck in a cycle of container restarts, so the service starts rejecting requests (502/503 errors).

The service can be kept stable by running a rollout restart of the kubernetes deployment every few hours (so that the pod volumes are recreated afresh, resetting the metric count before it grows too large).

Our other deployment was also failing with metric counts of 300k+ but did not exhibit ramping resource usage. The Prometheus server was not configured to scrape this other deployment (instead only gathering general pod performance metrics from the cluster, not application-specific metrics). Presumably the scraping is what makes excessive-metrics translate to over-utilisation of basic resources, but the accumulation will still cause the pod to fail even without scraping.

benjimin · 2021-11-11T20:06:36Z

Cured by no longer letting the flask exporter default to labelling grouped by path, i.e. by now initialising like

GunicornInternalPrometheusMetrics(app, group_by="endpoint")

Stats now stable (CPU 3%/pod, transmit 20kB/s/pod, scrapes 20ms, asymptoting to 800 metrics/pod, and memory consumption and packet rates in either direction also now holding low rather than ramping up).

benjimin · 2021-11-16T09:04:29Z

Instability also seems resolved.

Note prod and sandbox were different urls to the same deployment; the difference presumably corresponded to probes failing intermittently.

benjimin added a commit that referenced this issue Nov 9, 2021

Label metrics by endpoint not path (#337)

256284a

benjimin added a commit that referenced this issue Nov 9, 2021

Label metrics by endpoint not path (#337)

78dcd5c

benjimin added a commit that referenced this issue Nov 9, 2021

Label metrics by endpoint not path (#337)

bf8955b

benjimin mentioned this issue Nov 9, 2021

Fix for stability/metrics #338

Merged

benjimin added a commit that referenced this issue Nov 10, 2021

Label metrics by endpoint not path (#337)

6051dd6

jeremyh pushed a commit that referenced this issue Nov 10, 2021

Label metrics by endpoint not path (#337)

832e473

benjimin mentioned this issue Nov 11, 2021

Capitalise prometheus env var #339

Merged

benjimin closed this as completed Nov 11, 2021

benjimin mentioned this issue Nov 11, 2021

High cpu when using Gunicorn with multiprocess prometheus/client_python#568

Open

benjimin mentioned this issue Nov 17, 2021

[Monitoring/Metrics] Too many exposed time series opendatacube/datacube-ows#747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instability and ramping resource usage #337

Instability and ramping resource usage #337

benjimin commented Nov 4, 2021

benjimin commented Nov 9, 2021

benjimin commented Nov 9, 2021 •

edited

Loading

benjimin commented Nov 11, 2021 •

edited

Loading

benjimin commented Nov 11, 2021 •

edited

Loading

benjimin commented Nov 16, 2021

Instability and ramping resource usage #337

Instability and ramping resource usage #337

Comments

benjimin commented Nov 4, 2021

benjimin commented Nov 9, 2021

benjimin commented Nov 9, 2021 • edited Loading

benjimin commented Nov 11, 2021 • edited Loading

benjimin commented Nov 11, 2021 • edited Loading

benjimin commented Nov 16, 2021

benjimin commented Nov 9, 2021 •

edited

Loading

benjimin commented Nov 11, 2021 •

edited

Loading

benjimin commented Nov 11, 2021 •

edited

Loading