[target allocator] Scrape configuration hashing is resource intense #1544

matej-g · 2023-03-02T10:03:43Z

I'm trying out the target collector in environment with roughly 600 service monitors. What I'm constantly seeing that the target allocator container is consuming unexpectedly large amount of resources. I'm in process of collecting more data, but even checking via kubectl top shows lot of resource consumption:

<target_allocator_pod_name>          3137m        6924Mi

When consulting the profile, it seems a lot of this is coming from the hashing in the scrape configuration handler:

I assume this could be because of the large number of elements in the scrape config and the amount of walking / reflection through the structure it takes to construct a hash.

I'm considering whether hashing the configuration before it is marshalled could be more performant here, since:

We could use a fast hashing algorithm (like https://github.com/zeebo/xxh3) directly on the configuration as slice of bytes
We'd avoid having to marshal the configuration upfront, if we already can compare hashes of the byte slices

I'll provide a draft PR with suggested changes as well, but wanted to collect thoughts.

The text was updated successfully, but these errors were encountered:

matej-g · 2023-03-02T10:06:31Z

cc @open-telemetry/operator-ta-maintainers

rashmichandrashekar · 2023-07-17T17:12:08Z

The PR with the fix(#1545), fixes this issue as well - #1926
@matej-g - Are you planning to merge the PR sometime soon?

matej-g · 2023-07-18T11:48:48Z

Hey @rashmichandrashekar,
I unfortunately did not have enough time to compare if my solution in the end is more performant than existing hashing. But if it could solve your issue as well, I can take a look again and try to refresh PR. I'll try to take a look this week,

rashmichandrashekar · 2023-07-18T19:08:41Z

Hey @rashmichandrashekar, I unfortunately did not have enough time to compare if my solution in the end is more performant than existing hashing. But if it could solve your issue as well, I can take a look again and try to refresh PR. I'll try to take a look this week,

Thanks @matej-g!

swiatekm · 2023-09-29T10:23:48Z

This is particularly strange given that this code should only run on changes to the scrape configs, which mean either reloading the config file or Prometheus CRs changing. And that really shouldn't happen very often.

If it's a problem in some configurations, I'd prefer the same solution Prometheus' discovery manager uses, which is rate limiting notifications to 1 per 5 seconds.

matej-g · 2023-10-02T15:11:54Z

Hey @swiatekm-sumo, thanks for keeping an eye on this. Unfortunately I never got to move with this issue myself.

Regarding the code, I guess it's true it should only run if config change is detected. I can't recall now the exact conditions that were causing the spike in resource usage. Still:

We should not always assume that config change is infrequent, especially if we're dealing with big clusters as in this case (600+ monitors) or if a user consecutively applies multiple configurations for whatever reason. However here I agree we could rate limit.
Even in case of singular config updates, I guess it's preferable to avoid any resource usage spikes (if indeed config change is so resource intensive to cause a noticeable spike).

Putting all of this aside though, it seems like until now no one else has reported this issue, so the resource usage I experienced might not be a common thing. Unfortunately this fell through the gaps and I never got to finish my investigation.

swiatekm · 2023-10-02T16:02:57Z

Allright, makes sense. I have a change prepared that adds the rate limiting, so we can keep this issue open, and I'll link to it once I'm ready. Then we can see if anyone else encounters this problem.

I do have some fairly large production clusters where I have target allocator running, and its resource usage actually comes predominantly from recalculating targets, rather than scrape configs.

jaronoff97 · 2023-10-24T17:00:19Z

I believe the rate limiting from #2189 should resolve this. @matej-g please let me know if that helps :)

matej-g mentioned this issue Mar 2, 2023

[WIP] [target allocator] Improvements to scrape config hashing #1545

Draft

swiatekm mentioned this issue Sep 28, 2023

[target allocator] Rebuild targets on scrape config regex-only changes #2171

Merged

swiatekm mentioned this issue Oct 3, 2023

Add rate limiting for scrape config updates #2189

Merged

jaronoff97 closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[target allocator] Scrape configuration hashing is resource intense #1544

[target allocator] Scrape configuration hashing is resource intense #1544

matej-g commented Mar 2, 2023

matej-g commented Mar 2, 2023

rashmichandrashekar commented Jul 17, 2023

matej-g commented Jul 18, 2023

rashmichandrashekar commented Jul 18, 2023

swiatekm commented Sep 29, 2023

matej-g commented Oct 2, 2023

swiatekm commented Oct 2, 2023 •

edited

Loading

jaronoff97 commented Oct 24, 2023

[target allocator] Scrape configuration hashing is resource intense #1544

[target allocator] Scrape configuration hashing is resource intense #1544

Comments

matej-g commented Mar 2, 2023

matej-g commented Mar 2, 2023

rashmichandrashekar commented Jul 17, 2023

matej-g commented Jul 18, 2023

rashmichandrashekar commented Jul 18, 2023

swiatekm commented Sep 29, 2023

matej-g commented Oct 2, 2023

swiatekm commented Oct 2, 2023 • edited Loading

jaronoff97 commented Oct 24, 2023

swiatekm commented Oct 2, 2023 •

edited

Loading