Memory issues: objects used constantly increasing, shutdown appears not to be graceful #2083

conorevans · 2022-11-17T10:36:08Z

Firstly, really cool software!

Describe the bug

Phlare doesn't seem to free any of the objects it uses

leading to a constantly growing memory profile (I've included some other StatefulSets like Prometheus/Loki within the test env for context) - value in MiB:

You can also observe that every time I tried to stop the Phlare StatefulSet, the node it was running on died (gaps in metrics due to node exporter dying) -- I would find kubelet to no longer be responsive. There were no reported events of any kind, and the instance had plenty of free memory beforehand (~1GiB), even with this Phlare issue. I had to reboot the machine to resolve.

To Reproduce

Run Phlare with a standard set-up

Expected behaviour

Memory can fluctuate but is freed appropriately

Environment

Infrastructure: Deployed onto EKS multi-code, 4GiB RAM node. At peak had ~40 pods to scrape. It had the default scrape_configs - the only values I passed were structuredConfig.storage.s3 to persist the data to S3.
Deployment tool: Helm

Additional Context

In the image above (I'll duplicate below), there was only one pod (in addition to Phlare itself) that Phlare was scraping in the first two or three lifecycles of the Phlare deployment. The final one had ~40 pods to be scraped. So the problem existed even with just one pod. The last lifecycle was >1h so I was waiting to see if there was some sort of headblock maybe à la Prometheus that needed to be uploaded and by default is done once per hour (I couldn't see docs on it), but that didn't seem to happen.

Thankfully I have heard of this cool new software called Phlare which can help us debug 😉

Goroutines show there was no real fluctuation in the work Phlare had to do

Alloc objects vs inuse objects:

As you can see almost all of it is in convertSamples

The text was updated successfully, but these errors were encountered:

simonswine · 2022-11-22T11:11:29Z

Thanks for your feedback @conorevans. You have really done a great job lining out what is going wrong and provided all needed context. (And a bonus point for investigate using profiles in Phlare 🙂 )

What you are seeing is mostly expected, as to how Phlare v0.1 works right now:

Profiles are received and held in memory until either:

Their estimated size in memory reaches 1GiB
The maximum duration of a block (-phlaredb.max-block-duration=3h)

Then the profiles are written into a block on disk.

As the estimated size is hardcoded at present, you will need to assign a memory limit of at least 4-8 times of that estimated size. We realize this is not good enough for everyone. We are planning to work on #2115 for the next release, that will significantly cut the memory consumption.

As a workaround you can lower the -phlaredb.max-block-duration=30min which should limit the amount of memory required in your case and writing a block to disk every 30min (rather than every 3 hours). This will reclaim most of your memory and you should see a saw tooth wave in terms of in use memory space/objects.

Let me now how that goes and I will keep you updated in #2115

conorevans · 2022-11-22T12:27:11Z

Hey @simonswine

Ah, I see - thank you. Already since I last looked the docs have been fleshed out nicely and I can see that -phlaredb.max-block-duration flag. I had a poke around in the codebase to see if I could spot something similar but couldn't - docs are looking good now 🙂

Might be worth noting above the resources stanza for Helm operators. I can submit that if that would be welcomed.

Thanks!

simonswine · 2022-11-22T12:32:30Z

Might be worth noting above the resources stanza for Helm operators. I can submit that if that would be welcomed.

This is a very good idea and a PR from you with that would be very welcome as well 👍

conorevans mentioned this issue Nov 23, 2022

enhancement(helm): add note in resources section regarding max block duration grafana/phlare#435

Merged

simonswine transferred this issue from grafana/phlare Jul 19, 2023

Rperry2174 closed this as completed Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues: objects used constantly increasing, shutdown appears not to be graceful #2083

Memory issues: objects used constantly increasing, shutdown appears not to be graceful #2083

conorevans commented Nov 17, 2022 •

edited

Loading

simonswine commented Nov 22, 2022

conorevans commented Nov 22, 2022 •

edited

Loading

simonswine commented Nov 22, 2022

Memory issues: objects used constantly increasing, shutdown appears not to be graceful #2083

Memory issues: objects used constantly increasing, shutdown appears not to be graceful #2083

Comments

conorevans commented Nov 17, 2022 • edited Loading

Describe the bug

To Reproduce

Expected behaviour

Environment

Additional Context

simonswine commented Nov 22, 2022

conorevans commented Nov 22, 2022 • edited Loading

simonswine commented Nov 22, 2022

conorevans commented Nov 17, 2022 •

edited

Loading

conorevans commented Nov 22, 2022 •

edited

Loading