Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bank vat has a memory leak #5671

Closed
arirubinstein opened this issue Jun 26, 2022 · 9 comments
Closed

Bank vat has a memory leak #5671

arirubinstein opened this issue Jun 26, 2022 · 9 comments
Assignees
Labels
bug Something isn't working cosmic-swingset package: cosmic-swingset resource-exhaustion Threats to availability from resource exhaustion attacks SwingSet package: SwingSet
Milestone

Comments

@arirubinstein
Copy link
Contributor

arirubinstein commented Jun 26, 2022

When running the loadtest with 40 workers, the bringOutYourDead invocations for the bank vat seems to take longer and longer over time, suggesting that there is accumulated state growing in a way that is proportional to accumulating transaction volume.
Screen Shot 2022-06-26 at 12 30 50 PM

@arirubinstein arirubinstein added the bug Something isn't working label Jun 26, 2022
@arirubinstein
Copy link
Contributor Author

Screen Shot 2022-06-26 at 9 20 31 PM

more multi-day view

@turadg turadg added the SwingSet package: SwingSet label Jun 30, 2022
@warner
Copy link
Member

warner commented Jun 30, 2022

Next step is to get more direct evidence of a leak (which hopefully might point to what exactly is leaking).

First step is to use @mhofman 's improved telemetry to look specifically at the time spent doing xsnap GC in those BOYD deliveries (the time spent on the worker after receipt of the delivery, and before transmission of the first syscall / delivery result). That will rule out any messaging time and anything on the kernel process side. We should look at the number of syscalls emitted per BOYD (I'm assuming those are pretty low/constant, but if not, that's super interesting). If those are flat, then we can guess that most of the worker time is spent in the engineGC() call, and we know that should be O(live objects), which would then point to the number of live objects growing over time.

Then we should take a look at that vat's c-list at a couple of different points and see if its size is growing in a way that matches the GC time spent. We can randomly sample the c-list at the end of a run to find out if the outside world is holding onto things inside vat-bank (making it not exactly vat-bank's fault), or if vat-bank is holding onto a lot of imports (which might tell us what sort of object is getting stuck within vat-bank).

I know @michaelfig said there are some design improvements he has in mind for vat-bank .. Michael, should we investigate this vat's behavior first, to characterize the leak, or do you want to make code changes first?

@turadg
Copy link
Member

turadg commented Jul 6, 2022

@michaelfig is this ticket the leak you mentioned that #5709 might solve?

@warner
Copy link
Member

warner commented Jul 20, 2022

@erights can you add your current knowledge to this ticket?

@erights
Copy link
Member

erights commented Jul 21, 2022

What I think I remember is that mfig thought it was a hot vs cold notifier problem. That the hot notifier (or its underlying subscriber?) is eagerly notifying even when no one is interested. A cold notifier doesn't compute a value to notify until someone asks. Until then, it stays idle.

I do not have much confidence that I'm remembering this correctly.

@dckc dckc added the resource-exhaustion Threats to availability from resource exhaustion attacks label Jul 26, 2022
@warner warner self-assigned this Jul 27, 2022
@warner
Copy link
Member

warner commented Aug 4, 2022

In #5886 I have a demonstration of a memory leak (under Node.js) of 3.3kb per update, using a stripped down copy of the core "observeIteration() wrapped around a Notifier" pattern that vat-bank uses to distribute balance updates. I haven't tested under XS but I don't think there's anything engine-specific about this. @erights suggested experimenting with the way eventual-send tracks the debug information to emit "deep stacks", and indeed if I disable some of that mechanism, the leak goes away.

The next step is for @erights to help us find a way to disable that (perhaps selectively), or find a way to change observeIteration to not accumulate so much information. We'll do that work over on #5886.

@warner
Copy link
Member

warner commented Aug 5, 2022

#5892 has landed, which disables the deep-stack tracing. With luck, that will resolve the vat-bank memory leak (or at least the most obvious one).

The next step is to spin up a testnet and let it run for a day with the same load profile that was graphed at the top of this issue, and observe the new shape. If that is flat, we can close this, and move on to the next performance bottleneck.

@arirubinstein
Copy link
Contributor Author

It looks like at first glance this may be fixed:

I'll check back after the weekend to see if there is still growth.

Screen Shot 2022-08-06 at 11 23 21 PM

@warner
Copy link
Member

warner commented Aug 16, 2022

@arirubinstein are you satisfied that vat-bank is no longer the biggest problem? Could you close this issue if so?

@dckc dckc added pso cosmic-swingset package: cosmic-swingset labels Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cosmic-swingset package: cosmic-swingset resource-exhaustion Threats to availability from resource exhaustion attacks SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

7 participants