-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instability and ramping resource usage #337
Comments
One evident problem is an unbounded set of labels applied to prometheus metrics. (As noted in a recent datacube-wps issue, metrics are not for logs.)
Currently (at a low point in the sawtooth: only about 10% CPU utilisation) A few hours later: 70k+ metrics, taking ~1.6s per request, response output 11MB (i.e. uncompressed ~1MB/s per pod). Prometheus graphs suggest the label count may be 5x worse in about 12hrs. This explains the bandwidth ramping. Confirmed that hammering the metric endpoint (with consecutive requests) can spike CPU usage toward 100%. |
After another several hours, those pods are at 70% CPU with response times for the metric endpoint exceeding 10s and returning 300k metrics (43MB). Another deployment ("nci" rather than "prod") that was not exhibiting obvious resource-ramping but was still exhibiting eventual complete failure (e.g. liveliness probe failures, hundreds of container restarts overnight, leading to constant 502/503 errors from the aggregated endpoint i.e. bad gateway or service unavailable - until the pods are manually replaced each day) had accumulated ~330k metrics (46MB) on individual pods. The prometheus server status page is also highlighting the issue from another side: e.g. ~800k metric names for |
Note the pod consistently starts failing if the number of metrics reaches 320,000-340,000. (By then, scrape durations are also approaching the scraping interval, and the CPU is heavily consumed solely with outputting metrics data.) Generally the pod starts failing liveness probes and gets stuck in a cycle of container restarts, so the service starts rejecting requests (502/503 errors). The service can be kept stable by running a Our other deployment was also failing with metric counts of 300k+ but did not exhibit ramping resource usage. The Prometheus server was not configured to scrape this other deployment (instead only gathering general pod performance metrics from the cluster, not application-specific metrics). Presumably the scraping is what makes excessive-metrics translate to over-utilisation of basic resources, but the accumulation will still cause the pod to fail even without scraping. |
Cured by no longer letting the flask exporter default to labelling grouped by path, i.e. by now initialising like
Stats now stable (CPU 3%/pod, transmit 20kB/s/pod, scrapes 20ms, asymptoting to 800 metrics/pod, and memory consumption and packet rates in either direction also now holding low rather than ramping up). |
Lately have been observing some instability in this explorer deployment.
There is sawtooth ramping of the CPU utilisation, memory utilisation, and transmit bandwidth. That is, each new pod steadily increases both its CPU utilisation and its transmit bandwidth over the course of roughly 24 hours, and then the pods are collectively replaced. The stats peak at roughly 75% CPU utilisation, and 5MB/s egress per pod. The receive bandwidth is not ramping (maintaining a fairly consistent 0.1MB/s per pod).
Also (perhaps unrelated) there have been frequent failures from the front page (most commonly 502, 503, and timeouts) according to monitoring by uptime robot.
The text was updated successfully, but these errors were encountered: