Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instability and ramping resource usage #337

Closed
benjimin opened this issue Nov 4, 2021 · 5 comments
Closed

Instability and ramping resource usage #337

benjimin opened this issue Nov 4, 2021 · 5 comments

Comments

@benjimin
Copy link
Contributor

benjimin commented Nov 4, 2021

Lately have been observing some instability in this explorer deployment.

There is sawtooth ramping of the CPU utilisation, memory utilisation, and transmit bandwidth. That is, each new pod steadily increases both its CPU utilisation and its transmit bandwidth over the course of roughly 24 hours, and then the pods are collectively replaced. The stats peak at roughly 75% CPU utilisation, and 5MB/s egress per pod. The receive bandwidth is not ramping (maintaining a fairly consistent 0.1MB/s per pod).

image

Also (perhaps unrelated) there have been frequent failures from the front page (most commonly 502, 503, and timeouts) according to monitoring by uptime robot.

@benjimin
Copy link
Contributor Author

benjimin commented Nov 9, 2021

One evident problem is an unbounded set of labels applied to prometheus metrics. (As noted in a recent datacube-wps issue, metrics are not for logs.)

flask_http_request_duration_seconds_sum{method="GET",path="/product/high_tide_comp_20p/regions/6_-41",status="404"} 0.052539687836542726
flask_http_request_duration_seconds_sum{method="GET",path="/products/wofs_annual_summary/datasets/17c40f41-65d4-4e43-968b-5cf5d953fed6",status="200"} 0.1782301519997418
flask_http_request_duration_seconds_sum{method="GET",path="/dataset/0dcdef87-b507-42dc-a69b-c570501068ca.odc-metadata.yaml",status="200"} 0.031694565899670124
flask_http_request_duration_seconds_sum{method="GET",path="/products/ls7_fc_albers/datasets/6b772d4a-7c00-46b9-8b1a-38e4d95c3380",status="200"} 0.14692931901663542
flask_http_request_duration_seconds_sum{method="GET",path="/products/ga_ls5t_ard_3/datasets/379642d8-6ebe-4b55-9c71-92e345567dd5",status="200"} 0.11339368112385273
flask_http_request_duration_seconds_sum{method="GET",path="/dataset/65f34855-7322-4cf2-9a93-cffd12639cf7.odc-metadata.yaml",status="200"} 0.017415384063497186
flask_http_request_duration_seconds_sum{method="GET",path="/products/ga_ls_wo_3/datasets/e826f451-85c5-5aa4-ab0f-8776ce908b88",status="200"} 0.06057134480215609

Currently (at a low point in the sawtooth: only about 10% CPU utilisation) curl localhost:8080/metrics returns 30,000+ metrics and takes ~0.7 seconds. As the container serves more requests for information about different datasets, this list will continue to grow (and so incessant prometheus scraping will get slower i.e. more demanding). Scrapes are initiated every 10s.

A few hours later: 70k+ metrics, taking ~1.6s per request, response output 11MB (i.e. uncompressed ~1MB/s per pod). Prometheus graphs suggest the label count may be 5x worse in about 12hrs. This explains the bandwidth ramping.

Confirmed that hammering the metric endpoint (with consecutive requests) can spike CPU usage toward 100%.

image

image

@benjimin
Copy link
Contributor Author

benjimin commented Nov 9, 2021

After another several hours, those pods are at 70% CPU with response times for the metric endpoint exceeding 10s and returning 300k metrics (43MB).

Another deployment ("nci" rather than "prod") that was not exhibiting obvious resource-ramping but was still exhibiting eventual complete failure (e.g. liveliness probe failures, hundreds of container restarts overnight, leading to constant 502/503 errors from the aggregated endpoint i.e. bad gateway or service unavailable - until the pods are manually replaced each day) had accumulated ~330k metrics (46MB) on individual pods.

The prometheus server status page is also highlighting the issue from another side: e.g. ~800k metric names for flask_http_request_duration_seconds_bucket (the histogram buckets, plus another 50k+ each for the flask overall duration sum and request count) and over 3GB of memory consumed for just the path label alone.

@benjimin
Copy link
Contributor Author

benjimin commented Nov 11, 2021

Note the pod consistently starts failing if the number of metrics reaches 320,000-340,000. (By then, scrape durations are also approaching the scraping interval, and the CPU is heavily consumed solely with outputting metrics data.) Generally the pod starts failing liveness probes and gets stuck in a cycle of container restarts, so the service starts rejecting requests (502/503 errors).

The service can be kept stable by running a rollout restart of the kubernetes deployment every few hours (so that the pod volumes are recreated afresh, resetting the metric count before it grows too large).

image

Our other deployment was also failing with metric counts of 300k+ but did not exhibit ramping resource usage. The Prometheus server was not configured to scrape this other deployment (instead only gathering general pod performance metrics from the cluster, not application-specific metrics). Presumably the scraping is what makes excessive-metrics translate to over-utilisation of basic resources, but the accumulation will still cause the pod to fail even without scraping.

@benjimin
Copy link
Contributor Author

benjimin commented Nov 11, 2021

Cured by no longer letting the flask exporter default to labelling grouped by path, i.e. by now initialising like

GunicornInternalPrometheusMetrics(app, group_by="endpoint")

Stats now stable (CPU 3%/pod, transmit 20kB/s/pod, scrapes 20ms, asymptoting to 800 metrics/pod, and memory consumption and packet rates in either direction also now holding low rather than ramping up).

@benjimin
Copy link
Contributor Author

Instability also seems resolved.

image

Note prod and sandbox were different urls to the same deployment; the difference presumably corresponded to probes failing intermittently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant