Extremely inefficient metric querying can produce significant load on the monitored cluster #10

michaelklishin · 2023-09-06T11:29:02Z

FYI, the RabbitMQ core team pretty routinely sees this monitoring tool causing unreasonably high load on nodes because it uses GET /api/queues to get all metrics of all queues and that can generate very large payloads that max out network links in the short term.

Consider 100K queues with 60 metrics each, all in a single JSON collection: that would be 60M key-value pairs. If we assume that each pair on average is 30 bytes long, that's 180 MiB of data that will require 180M * 8 ≈ 11 GBit/s to transfer.

Add a frequent check on top and you see how this tool can wreck havoc on the system it monitors.

Consider using the Prometheus format.
Prometheus metrics are scraped from each node individually and support an aggregated metrics mode specifically for this kind of problems.

The text was updated successfully, but these errors were encountered:

michaelklishin changed the title ~~Extremely inefficient metric querying~~ Extremely inefficient metric querying can produce significant load on the monitored cluster Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely inefficient metric querying can produce significant load on the monitored cluster #10

Extremely inefficient metric querying can produce significant load on the monitored cluster #10

michaelklishin commented Sep 6, 2023

Extremely inefficient metric querying can produce significant load on the monitored cluster #10

Extremely inefficient metric querying can produce significant load on the monitored cluster #10

Comments

michaelklishin commented Sep 6, 2023