-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge amounts of memory usage when collecting data from Thanos/Prometheus #22
Comments
What huge means? (: Performance optimiations is never ending story. We need to find out if:
This will help us to find out how much we can improve 🤗 |
@gmfrasca do you have more specific numbers for the workflows? |
@4n4nd when running on 1-hour chunks, on a few of the larger metrics (subscription_labels, for example) we were getting OOMKilled even with 24GiB RAM allocated. We initially allocated 12GiB which saw the majority of the metrics getting OOMKilled, which we've worked around by reducing our chunk sizes by half (2 runs per hour, 30-minutes per chunk), but still saw the larger metrics fail at 12GiB. Due to hardware limitations we unfortunately cannot sustain running pods with 24GiB, but also can't reduce our chunk size much further as we would risk a job 'collision' (for example, with a 4 jobs/hour, 15-minute cadence, the XX:30 job possibly could end up in parallel with the XX:15 job, and would compete for resources that we may not have) I hope this is helpful, but please let me know if there are any other details I can provide. Thanks! |
Hey @bwplotka! With the holiday season concluding, just wanted bump/signal boost this issue. Would you have any insights on how we can approach alleviating the memory utilization here? In the short term I don't think we have a great deal of flexibility allocating more hardware resources, which leaves us in a tough spot for expanding the number of metrics we can retrieve, or adding features to the current pipelines. It would be great if we could track down and optimize the heavy mem usage areas in-code to get around that, if at all possible. |
Yea, The way format is to obtain profiles and figure out the problematic spot (: Feel free to read more about it here: https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/
If this is a tool not a service running we could use simple code like this (you can just copy this function) or import https://github.com/efficientgo/tools/blob/main/performance/pkg/profiles/profile.go#L17 |
When collecting data from Thanos, the tool uses a lot of memory. I think this is because all the metric data is downloaded and stored in memory while it is being processed and stored to the backend storage.
I know the Thanos Store API does not support streaming of data, but maybe we could chunk our queries somehow.
Same for Prometheus, Remote Read API does support streaming of data but it is not available in the upstream client yet.
The text was updated successfully, but these errors were encountered: