-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[target allocator] Scrape configuration hashing is resource intense #1544
Comments
cc @open-telemetry/operator-ta-maintainers |
Hey @rashmichandrashekar, |
Thanks @matej-g! |
This is particularly strange given that this code should only run on changes to the scrape configs, which mean either reloading the config file or Prometheus CRs changing. And that really shouldn't happen very often. If it's a problem in some configurations, I'd prefer the same solution Prometheus' discovery manager uses, which is rate limiting notifications to 1 per 5 seconds. |
Hey @swiatekm-sumo, thanks for keeping an eye on this. Unfortunately I never got to move with this issue myself. Regarding the code, I guess it's true it should only run if config change is detected. I can't recall now the exact conditions that were causing the spike in resource usage. Still:
Putting all of this aside though, it seems like until now no one else has reported this issue, so the resource usage I experienced might not be a common thing. Unfortunately this fell through the gaps and I never got to finish my investigation. |
Allright, makes sense. I have a change prepared that adds the rate limiting, so we can keep this issue open, and I'll link to it once I'm ready. Then we can see if anyone else encounters this problem. I do have some fairly large production clusters where I have target allocator running, and its resource usage actually comes predominantly from recalculating targets, rather than scrape configs. |
I'm trying out the target collector in environment with roughly 600 service monitors. What I'm constantly seeing that the target allocator container is consuming unexpectedly large amount of resources. I'm in process of collecting more data, but even checking via
kubectl top
shows lot of resource consumption:When consulting the profile, it seems a lot of this is coming from the hashing in the scrape configuration handler:
I assume this could be because of the large number of elements in the scrape config and the amount of walking / reflection through the structure it takes to construct a hash.
I'm considering whether hashing the configuration before it is marshalled could be more performant here, since:
I'll provide a draft PR with suggested changes as well, but wanted to collect thoughts.
The text was updated successfully, but these errors were encountered: