Skip to content
This repository has been archived by the owner on Sep 7, 2023. It is now read-only.

Extremely inefficient metric querying can produce significant load on the monitored cluster #10

Open
michaelklishin opened this issue Sep 6, 2023 · 0 comments

Comments

@michaelklishin
Copy link

FYI, the RabbitMQ core team pretty routinely sees this monitoring tool causing unreasonably high load on nodes because it uses GET /api/queues to get all metrics of all queues and that can generate very large payloads that max out network links in the short term.

Consider 100K queues with 60 metrics each, all in a single JSON collection: that would be 60M key-value pairs. If we assume that each pair on average is 30 bytes long, that's 180 MiB of data that will require 180M * 8 ≈ 11 GBit/s to transfer.

Add a frequent check on top and you see how this tool can wreck havoc on the system it monitors.

Consider using the Prometheus format.
Prometheus metrics are scraped from each node individually and support an aggregated metrics mode specifically for this kind of problems.

@michaelklishin michaelklishin changed the title Extremely inefficient metric querying Extremely inefficient metric querying can produce significant load on the monitored cluster Sep 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant