Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[target allocator] Scrape configuration hashing is resource intense #1544

Closed
matej-g opened this issue Mar 2, 2023 · 8 comments
Closed

[target allocator] Scrape configuration hashing is resource intense #1544

matej-g opened this issue Mar 2, 2023 · 8 comments

Comments

@matej-g
Copy link
Contributor

matej-g commented Mar 2, 2023

I'm trying out the target collector in environment with roughly 600 service monitors. What I'm constantly seeing that the target allocator container is consuming unexpectedly large amount of resources. I'm in process of collecting more data, but even checking via kubectl top shows lot of resource consumption:

<target_allocator_pod_name>          3137m        6924Mi

When consulting the profile, it seems a lot of this is coming from the hashing in the scrape configuration handler:
Screenshot 2023-03-01 at 18 11 36

I assume this could be because of the large number of elements in the scrape config and the amount of walking / reflection through the structure it takes to construct a hash.

I'm considering whether hashing the configuration before it is marshalled could be more performant here, since:

  1. We could use a fast hashing algorithm (like https://github.com/zeebo/xxh3) directly on the configuration as slice of bytes
  2. We'd avoid having to marshal the configuration upfront, if we already can compare hashes of the byte slices

I'll provide a draft PR with suggested changes as well, but wanted to collect thoughts.

@matej-g
Copy link
Contributor Author

matej-g commented Mar 2, 2023

cc @open-telemetry/operator-ta-maintainers

@rashmichandrashekar
Copy link
Contributor

The PR with the fix(#1545), fixes this issue as well - #1926
@matej-g - Are you planning to merge the PR sometime soon?

@matej-g
Copy link
Contributor Author

matej-g commented Jul 18, 2023

Hey @rashmichandrashekar,
I unfortunately did not have enough time to compare if my solution in the end is more performant than existing hashing. But if it could solve your issue as well, I can take a look again and try to refresh PR. I'll try to take a look this week,

@rashmichandrashekar
Copy link
Contributor

Hey @rashmichandrashekar, I unfortunately did not have enough time to compare if my solution in the end is more performant than existing hashing. But if it could solve your issue as well, I can take a look again and try to refresh PR. I'll try to take a look this week,

Thanks @matej-g!

@swiatekm
Copy link
Contributor

This is particularly strange given that this code should only run on changes to the scrape configs, which mean either reloading the config file or Prometheus CRs changing. And that really shouldn't happen very often.

If it's a problem in some configurations, I'd prefer the same solution Prometheus' discovery manager uses, which is rate limiting notifications to 1 per 5 seconds.

@matej-g
Copy link
Contributor Author

matej-g commented Oct 2, 2023

Hey @swiatekm-sumo, thanks for keeping an eye on this. Unfortunately I never got to move with this issue myself.

Regarding the code, I guess it's true it should only run if config change is detected. I can't recall now the exact conditions that were causing the spike in resource usage. Still:

  1. We should not always assume that config change is infrequent, especially if we're dealing with big clusters as in this case (600+ monitors) or if a user consecutively applies multiple configurations for whatever reason. However here I agree we could rate limit.
  2. Even in case of singular config updates, I guess it's preferable to avoid any resource usage spikes (if indeed config change is so resource intensive to cause a noticeable spike).

Putting all of this aside though, it seems like until now no one else has reported this issue, so the resource usage I experienced might not be a common thing. Unfortunately this fell through the gaps and I never got to finish my investigation.

@swiatekm
Copy link
Contributor

swiatekm commented Oct 2, 2023

Allright, makes sense. I have a change prepared that adds the rate limiting, so we can keep this issue open, and I'll link to it once I'm ready. Then we can see if anyone else encounters this problem.

I do have some fairly large production clusters where I have target allocator running, and its resource usage actually comes predominantly from recalculating targets, rather than scrape configs.

@jaronoff97
Copy link
Contributor

I believe the rate limiting from #2189 should resolve this. @matej-g please let me know if that helps :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants