Huge amounts of memory usage when collecting data from Thanos/Prometheus #22

4n4nd · 2020-11-11T14:52:33Z

When collecting data from Thanos, the tool uses a lot of memory. I think this is because all the metric data is downloaded and stored in memory while it is being processed and stored to the backend storage.

I know the Thanos Store API does not support streaming of data, but maybe we could chunk our queries somehow.
Same for Prometheus, Remote Read API does support streaming of data but it is not available in the upstream client yet.

bwplotka · 2020-11-12T12:57:13Z

What huge means? (:

Performance optimiations is never ending story. We need to find out if:

Perf improvements are worth it, prioritize it. Is tool unusable? How much memory is a "good" amount of memory to be used vs data fetched? (: Can you give some numbers, repro? (:

This will help us to find out how much we can improve 🤗

4n4nd · 2020-11-12T16:17:27Z

@gmfrasca do you have more specific numbers for the workflows?

gmfrasca · 2020-11-17T19:17:41Z

@4n4nd when running on 1-hour chunks, on a few of the larger metrics (subscription_labels, for example) we were getting OOMKilled even with 24GiB RAM allocated.

We initially allocated 12GiB which saw the majority of the metrics getting OOMKilled, which we've worked around by reducing our chunk sizes by half (2 runs per hour, 30-minutes per chunk), but still saw the larger metrics fail at 12GiB. Due to hardware limitations we unfortunately cannot sustain running pods with 24GiB, but also can't reduce our chunk size much further as we would risk a job 'collision' (for example, with a 4 jobs/hour, 15-minute cadence, the XX:30 job possibly could end up in parallel with the XX:15 job, and would compete for resources that we may not have)

I hope this is helpful, but please let me know if there are any other details I can provide. Thanks!

gmfrasca · 2021-01-11T19:09:30Z

Hey @bwplotka! With the holiday season concluding, just wanted bump/signal boost this issue. Would you have any insights on how we can approach alleviating the memory utilization here?

In the short term I don't think we have a great deal of flexibility allocating more hardware resources, which leaves us in a tough spot for expanding the number of metrics we can retrieve, or adding features to the current pipelines. It would be great if we could track down and optimize the heavy mem usage areas in-code to get around that, if at all possible.

bwplotka · 2021-01-11T19:16:55Z

Yea, The way format is to obtain profiles and figure out the problematic spot (:

Feel free to read more about it here: https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/

pprof endpoint should be already available.

If this is a tool not a service running we could use simple code like this (you can just copy this function) or import https://github.com/efficientgo/tools/blob/main/performance/pkg/profiles/profile.go#L17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge amounts of memory usage when collecting data from Thanos/Prometheus #22

Huge amounts of memory usage when collecting data from Thanos/Prometheus #22

4n4nd commented Nov 11, 2020

bwplotka commented Nov 12, 2020

4n4nd commented Nov 12, 2020

gmfrasca commented Nov 17, 2020 •

edited

Loading

gmfrasca commented Jan 11, 2021

bwplotka commented Jan 11, 2021

Huge amounts of memory usage when collecting data from Thanos/Prometheus #22

Huge amounts of memory usage when collecting data from Thanos/Prometheus #22

Comments

4n4nd commented Nov 11, 2020

bwplotka commented Nov 12, 2020

4n4nd commented Nov 12, 2020

gmfrasca commented Nov 17, 2020 • edited Loading

gmfrasca commented Jan 11, 2021

bwplotka commented Jan 11, 2021

gmfrasca commented Nov 17, 2020 •

edited

Loading