Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues: objects used constantly increasing, shutdown appears not to be graceful #2083

Closed
conorevans opened this issue Nov 17, 2022 · 3 comments

Comments

@conorevans
Copy link
Contributor

conorevans commented Nov 17, 2022

Firstly, really cool software!

Describe the bug

Phlare doesn't seem to free any of the objects it uses

Screenshot 2022-11-17 at 00 00 17

leading to a constantly growing memory profile (I've included some other StatefulSets like Prometheus/Loki within the test env for context) - value in MiB:

Screenshot 2022-11-17 at 10 14 16

You can also observe that every time I tried to stop the Phlare StatefulSet, the node it was running on died (gaps in metrics due to node exporter dying) -- I would find kubelet to no longer be responsive. There were no reported events of any kind, and the instance had plenty of free memory beforehand (~1GiB), even with this Phlare issue. I had to reboot the machine to resolve.

To Reproduce

Run Phlare with a standard set-up

Expected behaviour

Memory can fluctuate but is freed appropriately

Environment

  • Infrastructure: Deployed onto EKS multi-code, 4GiB RAM node. At peak had ~40 pods to scrape. It had the default scrape_configs - the only values I passed were structuredConfig.storage.s3 to persist the data to S3.
  • Deployment tool: Helm

Additional Context

In the image above (I'll duplicate below), there was only one pod (in addition to Phlare itself) that Phlare was scraping in the first two or three lifecycles of the Phlare deployment. The final one had ~40 pods to be scraped. So the problem existed even with just one pod. The last lifecycle was >1h so I was waiting to see if there was some sort of headblock maybe à la Prometheus that needed to be uploaded and by default is done once per hour (I couldn't see docs on it), but that didn't seem to happen.

Screenshot 2022-11-17 at 10 14 16

Thankfully I have heard of this cool new software called Phlare which can help us debug 😉

Goroutines show there was no real fluctuation in the work Phlare had to do

Screenshot 2022-11-16 at 23 59 50

Alloc objects vs inuse objects:

Screenshot 2022-11-17 at 00 00 07

Screenshot 2022-11-17 at 00 00 17

As you can see almost all of it is in convertSamples

@simonswine
Copy link
Contributor

Thanks for your feedback @conorevans. You have really done a great job lining out what is going wrong and provided all needed context. (And a bonus point for investigate using profiles in Phlare 🙂 )

What you are seeing is mostly expected, as to how Phlare v0.1 works right now:

Profiles are received and held in memory until either:

  • Their estimated size in memory reaches 1GiB
  • The maximum duration of a block (-phlaredb.max-block-duration=3h)

Then the profiles are written into a block on disk.

As the estimated size is hardcoded at present, you will need to assign a memory limit of at least 4-8 times of that estimated size. We realize this is not good enough for everyone. We are planning to work on #2115 for the next release, that will significantly cut the memory consumption.

As a workaround you can lower the -phlaredb.max-block-duration=30min which should limit the amount of memory required in your case and writing a block to disk every 30min (rather than every 3 hours). This will reclaim most of your memory and you should see a saw tooth wave in terms of in use memory space/objects.

Let me now how that goes and I will keep you updated in #2115

@conorevans
Copy link
Contributor Author

conorevans commented Nov 22, 2022

Hey @simonswine

Ah, I see - thank you. Already since I last looked the docs have been fleshed out nicely and I can see that -phlaredb.max-block-duration flag. I had a poke around in the codebase to see if I could spot something similar but couldn't - docs are looking good now 🙂

Might be worth noting above the resources stanza for Helm operators. I can submit that if that would be welcomed.

Thanks!

@simonswine
Copy link
Contributor

Might be worth noting above the resources stanza for Helm operators. I can submit that if that would be welcomed.

This is a very good idea and a PR from you with that would be very welcome as well 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants