Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Memory Leak #45

Open
irond13 opened this issue Dec 7, 2015 · 5 comments
Open

Potential Memory Leak #45

irond13 opened this issue Dec 7, 2015 · 5 comments

Comments

@irond13
Copy link

irond13 commented Dec 7, 2015

We've been trying to track down a memory leak which has been killing our prod boxes. Eventually, with the help of the heapdump module, we've been led to believe that it has something to do with either trycatch, or the way we use it. See attached screenshot of profile.

memoryleak

The doCatch function is the closure we pass as the catchFn to trycatch, and onError and onMessageHandled are functions referenced from doCatch.

According to the profile, it seems that these functions can't get cleaned up due to persistent references from inside trycatch, in particular via the _trycatchOnError function. This function in turn creates a reference from the child domain to the parent domain. It does this via

d.on('error', _trycatchOnError)
and
handleException(err, parentDomain, catchFn)
. Seemingly this reference from child to parent gets cascaded and prevents the whole chain from being cleaned up.

To test this theory, I hacked trycatch locally to pass a callback to the tryFn (incidentally the catchFn doesn't need to get invoked to create to the leak). This callback would then remove the listener added at

d.on('error', _trycatchOnError)
which breaks the chain and gets rid of the leak.

I'm not sure whether this is a feasible route to go down though, as handling the error case, where the catchFn actually needs to be called, which itself could fail, would be challenging.

What is also confusing is that we do use trycatch elsewhere and I haven't seemed to notice any leaks there, though that could be due to us using it at much larger scale in this use case, which makes the leaks visible.

@CrabDude
Copy link
Owner

Oh fun. First, thanks for all the work you've done here. trycatch is complex, and memory leaks even more so.

So, this is basically an asynchronous stack overflow, and so isn't a memory-leak in the traditional sense, so much as increased memory retention/overhead from retained parent domains while child domains remain. Normally, this is not an issue, since mostly I use trycatch in a request oriented context (e.g., http 500s), and so inevitably, the child domains are released and the parent domains along with it.

However, it would cause a memory-leak equivalent in circumstances where you had an indefinitely growing asynchronous call stack (A calls B calls C calls D ... ad infinitum), which I suspect is what you're seeing. This is analogous to V8 historically limiting a call-stack to 200 frames for memory reasons.

So..... Since catchFn is run in the parent domain for nesting support, your "fix" (which I suspect does address the memory leak), will break catch support when errors are asynchronously thrown from catchFn while nesting trycatch calls.

This is all stupid in the sense that domain is deprecated and should really just be torn out, despite it theoretically providing better resource cleanup (probably no longer true since domains have been deprecated for a while) and some EventEmitter conveniences. Domains can be replaced with an object that stores the state of the current synchronous stack and the associated catchFn.

Not sure when I'll have a chance to get to this since it may be a decent amount of work. Then again, since we'd just be removing domains with some shallow EE instance, it may potentially be very easy.

TLDR; Domains should be removed. Your hack solves your issue, but breaks catching of async catchFn errors when nesting.

@CrabDude
Copy link
Owner

I'm honestly trying to think of a short-term solution for you, but there really isn't one. Your fix is the your best option since the async catch nesting isn't necessarily a big concern for you.

Also, domains could be kept in a WeakMap, on the catchFn / EE state object, so that the async error nesting would work, but allow domains to be GCd. This is maybe a more attractive solution, though I'm not sure it's any easier and so doesn't help you at the moment.

@CrabDude
Copy link
Owner

Looks like WeakMap is supported without a flag in v0.12, which is promising, because it precludes the need for the tradeoff (removing domains before an alternative exists).

@irond13
Copy link
Author

irond13 commented Dec 17, 2015

Thanks a lot for the feedback!

Your mention of WeakMap lead me to try this: irond13@14d364d. This didn't have the desired effect - it just shifted the memory usage elsewhere, the net usage remaining the same. It turns out that the semantics of WeakMap are that the keys are weakly referenced and not the values, contrary to my initial assumption. In light of this, the observed behaviour makes complete sense, and I suspect makes a solution using WeakMap impossible. If we had something like Java's WeakReference, then I could foresee a solution, but it seems that there is no such thing in the pure JS world at the moment. https://github.com/TooTallNate/node-weak could potentially work (I haven't tried it), but it feels too heavy for my liking.

I then turned my attention to what was arguably the actual source of this issue - the potentially infinite nesting of trycatch calls. You were correct in your assumption that this is what was happening (the Retainers pane in the heapdump I posted in the OP actually shows this pretty clearly - I just couldn't grok it at the time :)). In a nutshell what we had was the following (this is a queue listener of sorts):

function onError(err){/* handle error */}
var initiateMessageProcessing = function(){
  trycatch(function doTry(){
      fetchMessage(function messageReceived(err, msg){
        initiateMessageProcessing(); //Start processing/fetching of next message
        if (err) return onError(err);       
        //do actual (async) message processing here
      });
    }, 
    function doCatch(err){
      onError();
    }
  );
};
initiateMessageProcessing();

Looking at this, the infinite trycatch nesting seems obvious. Interestingly, this doesn't actually result in an infinite domain stack (I debugged domain.enter and the domain stack size never grew greater than around 5, mirroring the max concurrent - not cumulative - instances of doTry running at any one time).

In this case, the nesting of trycatch calls is just a consequence of the 'recursive' initiateMessageProcessing call - it isn't the intent (actually the opposite - ideally each initiateMessageProcessing run should be treated as an independant 'request'). With that in mind, I first tried to remove the nesting of domains by explicitly exiting and then re-entering the active domain around the inner initiateMessageProcessing call, e.g.

//Inside messageReceived
var activeDomain = domain.active;
activeDomain && activeDomain.exit();
initiateMessageProcessing()
activeDomain && activeDomain.enter();

That didn't work, because as it turned out, each time that code was hit, the active domain also happened to be the next domain on the stack, so calling activeDomain.exit() left domain.active unchanged. This happened consistently. I tracked this down to internal node code (inside the processing of the next tick) which called domain.enter() on a domain without checking if it was already active (trycatch correctly does this check, but it had already entered the domain by this stage).

My final port of call was to create my own root domain and bind initiateMessageProcessing to that, as follows

var rootDomain = domain.active || domain.create();
var initiateMessageProcessing = function(){/* as before */};
initiateMessageProcessing = rootDomain.bind(initiateMessageProcessing);
initiateMessageProcessing();

Fortunately, this worked! I'm relatively satisfied with this solution so am I'm going to go with it for now.

So, in closing, since it turned out that it was actually my code which caused the nesting of trycatch I'm going to close this issue. I'm not entirely satisfied with my solution to this particular use case though, as it relies on implementation details of trycatch (it was quite nice not having to know that trycatch uses domains under the hood). If for some reason an alternative to domains does materialise some day, I'll be more than happy to update my workaround - after this expedition I'm in agreement with those who say that domains should make an exit from Node sooner rather than later (excuse the pun!).

@irond13 irond13 closed this as completed Dec 17, 2015
@CrabDude
Copy link
Owner

I appreciate you working with me to reach a solution. You're a model user / contributor. ;)

I think there's definitely a solution here using WeakMap (I know you concluded the opposite), that would allow the parent domains to close, while maintaining a reference to the catch callback. This would completely eliminate all memory growth except the catch callback function, which would be relatively negligible. With long-stack-traces on, it would also keep a ~100B reference to a call stack string. Both would add up for long running processes.

There still needs to be a way to avoid these slow leaks, so I'm going to reopen this. Your hack solution inspires me to add a nesting opt-out, which I had never considered:

trycatch.noNest(tryFn, catchFn)
// or...
trycatch.root(tryFn, catchFn)

This will be relatively easy to add, as it would just skip the check for a currently active domain. Thanks!

@CrabDude CrabDude reopened this Dec 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants