Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge amounts of memory usage when collecting data from Thanos/Prometheus #22

Open
4n4nd opened this issue Nov 11, 2020 · 5 comments
Open

Comments

@4n4nd
Copy link
Contributor

4n4nd commented Nov 11, 2020

When collecting data from Thanos, the tool uses a lot of memory. I think this is because all the metric data is downloaded and stored in memory while it is being processed and stored to the backend storage.

I know the Thanos Store API does not support streaming of data, but maybe we could chunk our queries somehow.
Same for Prometheus, Remote Read API does support streaming of data but it is not available in the upstream client yet.

@bwplotka
Copy link
Member

What huge means? (:

Performance optimiations is never ending story. We need to find out if:

  • Perf improvements are worth it, prioritize it. Is tool unusable? How much memory is a "good" amount of memory to be used vs data fetched? (: Can you give some numbers, repro? (:

This will help us to find out how much we can improve 🤗

@4n4nd
Copy link
Contributor Author

4n4nd commented Nov 12, 2020

@gmfrasca do you have more specific numbers for the workflows?

@gmfrasca
Copy link
Contributor

gmfrasca commented Nov 17, 2020

@4n4nd when running on 1-hour chunks, on a few of the larger metrics (subscription_labels, for example) we were getting OOMKilled even with 24GiB RAM allocated.

We initially allocated 12GiB which saw the majority of the metrics getting OOMKilled, which we've worked around by reducing our chunk sizes by half (2 runs per hour, 30-minutes per chunk), but still saw the larger metrics fail at 12GiB. Due to hardware limitations we unfortunately cannot sustain running pods with 24GiB, but also can't reduce our chunk size much further as we would risk a job 'collision' (for example, with a 4 jobs/hour, 15-minute cadence, the XX:30 job possibly could end up in parallel with the XX:15 job, and would compete for resources that we may not have)

I hope this is helpful, but please let me know if there are any other details I can provide. Thanks!

@gmfrasca
Copy link
Contributor

Hey @bwplotka! With the holiday season concluding, just wanted bump/signal boost this issue. Would you have any insights on how we can approach alleviating the memory utilization here?

In the short term I don't think we have a great deal of flexibility allocating more hardware resources, which leaves us in a tough spot for expanding the number of metrics we can retrieve, or adding features to the current pipelines. It would be great if we could track down and optimize the heavy mem usage areas in-code to get around that, if at all possible.

@bwplotka
Copy link
Member

Yea, The way format is to obtain profiles and figure out the problematic spot (:

Feel free to read more about it here: https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/

pprof endpoint should be already available.

If this is a tool not a service running we could use simple code like this (you can just copy this function) or import https://github.com/efficientgo/tools/blob/main/performance/pkg/profiles/profile.go#L17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants