Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: Cannot read property 'enter' of undefined #30122

Closed
emhagman opened this issue Oct 25, 2019 · 39 comments
Closed

TypeError: Cannot read property 'enter' of undefined #30122

emhagman opened this issue Oct 25, 2019 · 39 comments
Labels
domain Issues and PRs related to the domain subsystem.

Comments

@emhagman
Copy link

emhagman commented Oct 25, 2019

Version: 12.13.0
Platform: Alpine 3.9 (Linux x86_64) on AWS ECS
Subsystem: domain

TypeError: Cannot read property 'enter' of undefined
    at AsyncHook.before (domain.js:76:20)
    at emitHook (internal/async_hooks.js:164:38)

We use @sentry/node to log our errors in express, which is using the domain module here:
https://github.com/getsentry/sentry-javascript/blob/master/packages/node/src/handlers.ts#L271

So far it is extremely hard to reproduce but if we run our production server for around 10 minutes, our application will crash with the above error and no other stack trace. At first, we thought it was related to #28275 but it appears to be different as it looks like the fix for this landed around 12.8.0 and we can confirm that this happens in 12.13.0 still.

Versions 12.2.0 or below do not have this issue.

I believe the actual line is 78, not 76 which can be seen here:
https://github.com/nodejs/node/blob/master/lib/domain.js#L78

EDIT: Thanks @richardlau, line 79 in v12.13.0 https://github.com/nodejs/node/blob/v12.13.0/lib/domain.js#L79

@richardlau
Copy link
Member

In 12.13.0 the line would appear to be 79:
https://github.com/nodejs/node/blob/v12.13.0/lib/domain.js#L79

@addaleax addaleax added the domain Issues and PRs related to the domain subsystem. label Oct 25, 2019
@addaleax
Copy link
Member

@emhagman Is that all the stack trace you get? Is there any chance you could figure out what type of resource the hook is called for?

@emhagman
Copy link
Author

emhagman commented Oct 26, 2019

@addaleax This is the only stack trace that shows up, unfortunately. I can try to add a lot more logging and get some more context around when it happens but that is about it. Can I run node with any useful DEBUG= or command-line arguments that would help you out?

@addaleax
Copy link
Member

addaleax commented Oct 26, 2019

@emhagman Can you dump async hooks output to a file (à la https://github.com/addaleax/node/blob/ee4027d5bbabe87d4446a62ec0b7b60e7af59c30/test.js#L3-L18 except not /dev/stderr) and make sure that hook is installed and active before any code that accesses domains? That might already be very helpful.

@emhagman
Copy link
Author

emhagman commented Oct 26, 2019

@addaleax Sure thing I'll get it to you as soon as I can. Thanks for the help!

@addaleax
Copy link
Member

Did Github delete a comment here? Anyway; just thinking out loud…

The debug log shows that there is no destroy for the async id in question, so the C++ TLSWRAP object should still be alive, and thus keep the domain object alive through its .domain property; but the error message basically says that the domain has been garbage collected and is no longer accessible. The conclusion for me would be that the domain property on the native TLSWrap object is un-set at some point, but I really can’t think of a way that that would happen accidentally.

I’m not really sure how to best debug this further – one thing you could try is running node with --abort-on-uncaught-exception and share the core dump here, if that’s an option for you… that should at least enable verifying whether the property is set at time of the crash or not.

@emhagman
Copy link
Author

@addaleax Sorry, I deleted the comment because I left the resource out of the logs. I was trying to get a log for you that included the resource but I currently can't make the app crash again, it really is hard to reproduce. I am doing these tests in one of our staging environments so adding --abort-on-uncaught-exception won't be a problem.

I'm not sure if it helps but we can anecdotally confirm (other users in that thread) that not using the @sentry/node middleware removes the issue. I don't know if you saw the way they're using domain and got any insight. They use it throughout the @sentry/* ecosystem for Node.

@addaleax
Copy link
Member

@emhagman I’m not personally familiar with their code, although you can of course also report this as an issue to them if you haven’t already (and ideally add a link this one).

@emhagman
Copy link
Author

@addaleax I have and they have deemed it an issue with Node, unfortunately. Since this has to do with GC, are there any ways to force a situation where this is more likely to occur? As of now, I have no idea how to reproduce it other than to use our app until it breaks

@addaleax
Copy link
Member

@emhagman You could set the value of the --gc-interval=n flag to a low value, if you want to cause more frequent GC; that might help?

@emhagman
Copy link
Author

emhagman commented Oct 26, 2019

@addaleax I have the logs with the resources. Adding --gc-interval=5 sped up the error occurring, thanks for the tip.

ASYNC_HOOK_DEBUG: init {
  id: 47259,
  type: 'TLSWRAP',
  triggerId: 47235,
  resource: ReusedHandle {
    type: 43,
    handle: TLSWrap {
      _parent: [TCP],
      _parentWrap: undefined,
      _secureContext: [SecureContext],
      reading: true,
      onhandshakestart: [Function: noop],
      onhandshakedone: [Function],
      onocspresponse: [Function: onocspresponse],
      onnewsession: [Function: onnewsessionclient],
      onkeylog: [Function: onkeylogclient],
      onerror: [Function: onerror],
      [Symbol(owner)]: [TLSSocket]
    }
  }
}

ASYNC_HOOK_DEBUG: before { id: 47259 }
TypeError: Cannot read property 'enter' of undefined
    at AsyncHook.before (domain.js:79:20)
    at emitHook (internal/async_hooks.js:164:38)
Aborted (core dumped)

Not sure if that helps. Working on getting the dump from my container now

EDIT: We use ECS Fargate on AWS so that will be impossible. I'll have to try and reproduce this locally so I can get access to the dump.

@addaleax
Copy link
Member

@emhagman Alright, this makes a bit more sense now. The .domain property isn’t actually set on the real resource, it’s set on a fake resource object used by the HTTP agent that can be garbage collected independently.

I’m assuming that 3d9d1ad is responsible for this … any chance you could verify that the bug was introduced in Node v12.3.0?

Also, if it helps with reproducing: If i’m correct, this will mostly happen when the https.Agent re-uses sockets for multiple requests to the same host.

@emhagman
Copy link
Author

emhagman commented Oct 26, 2019

@addaleax I believe others have already reproduced this bug in 12.3.0 and mentioned reverting to 12.2.0 fixes it in the Sentry issue thread. If I can make this happen on demand I'll be able to be positive about that.

What you said about the https agent makes perfect sense, we have a client that uses the same https agent with keepAlive on to the same host to try to save on connections to one of our internal services. That definitely helps with me getting closer to a reproducible bug. I'll keep digging, thanks again for the help.

@addaleax
Copy link
Member

@emhagman If you’re in a position to try out patches to Node.js, could you try this one?

diff --git a/lib/_http_agent.js b/lib/_http_agent.js
index dcb5ed376de8..f4d9bb7fe36a 100644
--- a/lib/_http_agent.js
+++ b/lib/_http_agent.js
@@ -44,10 +44,15 @@ const {
 // ClientRequest.onSocket(). The Agent is now *strictly*
 // concerned with managing a connection pool.
 
+const kReusedHandle = Symbol('kReusedHandle');
+
 class ReusedHandle {
   constructor(type, handle) {
     this.type = type;
     this.handle = handle;
+    // Tie the lifetime of the two objects together, mostly for the 'domain'
+    // module.
+    handle[kReusedHandle] = this;
   }
 }
 

@emhagman
Copy link
Author

As soon as I can reproduce it easily I'll try the patch 👍

It appears the library in question also uses httpsAgent with keepAlive: true so I think you're onto something! https://github.com/getsentry/sentry-javascript/blob/edd08f9fa4e4c387ed9c3ba2500119c78e67be15/packages/raven-node/lib/transports.js#L10

@jmendiara

This comment has been minimized.

@addaleax

This comment has been minimized.

@emhagman
Copy link
Author

emhagman commented Oct 27, 2019

@addaleax I am still trying to reproduce it but I noticed I linked to old code. They currently use https://github.com/getsentry/sentry-javascript/blob/master/packages/node/src/transports/https.ts#L18 at Sentry. The keepAlive: false seems interesting as well as the timeout: 2000.

Does that help you at all?

@simllll
Copy link
Contributor

simllll commented Oct 27, 2019

I'm following this issue as well as I'm really interested in solving it, just one thing about the keepalive flag and the httpsagent. Ist the agent also responsible for regular requests in express? I'm asking because the domain module is used in the Middleware of express, and as far as I understand the sentry client is using the http agent also for transferring the results back the the sentry backend, but in this case the domain is not that relevant. It is in the middleware though ... Assuming this is correctl, the "right way" of reproducing the issue would be sending parallel requests to your server so that the agent gets reused after some calls.

Not sure though,just my 2 cents.

@emhagman
Copy link
Author

emhagman commented Oct 27, 2019

@simllll That particular agent is not responsible for regular requests in express. Correct me if I am wrong @addaleax, but this actually has nothing to do with Sentry using domain in their own codebase but actually has to do with their use of httpsAgent to send requests to their server.

Here is my current code I am using to try to reproduce the issue. Using --gc-interval=5 and still can't seem to get the error.

const https = require('https');
const fs = require('fs');
const util = require('util');
const crypto = require('crypto');
const async_hooks = require('async_hooks');

// async_hooks debug dump code...
function debug(...args) {
    fs.writeFileSync('/dev/stdout', `ASYNC_HOOK_DEBUG: ${util.format(...args)}\n`, { flag: 'a' });
}

async_hooks.createHook({
    init(id, type, triggerId, resource) {
        debug('init', { id, type, triggerId, resource });
    },
    before(id) {
        debug('before', { id });
    },
    after(id) {
        debug('after', { id });
    },
    destroy(id) {
        debug('destroy', { id });
    },
}).enable();

const httpsAgent = new https.Agent({ keepAlive: false, maxSockets: 5, timeout: 2000 });
const httpsOpts = {
    protocol: 'https:',
    method:   'POST',
    hostname: 'postman-echo.com',
    path:     '/post',
    port:     443,
    agent:    httpsAgent
};

function imitateSentrySend(httpModule, event) {
    return new Promise((resolve, reject) => {
        const opts = Object.assign(httpsOpts, event.options);
        const req = https.request(opts, function(res) {
            res.setEncoding('utf8');
            if (res.statusCode >= 200 && res.statusCode < 300) {
                resolve({ status: res.statusCode });
            } else {
                reject(new Error(`Request failed with statusCode: ${res.statusCode}`));
            }
            res.on('data', () => {});
            res.on('end', () => {});
        });
        req.on('error', reject);
        req.end(JSON.stringify(event));
    });
}

(async function main() {
    while (true) {
        const id = crypto.randomBytes(8).toString('hex');
        const event = { id, msg: 'Some random error has occurred, sending to Sentry...' };
        const fakeArray = Array(Math.ceil(Math.random() * 4)).fill(Math.random());
        const promises = fakeArray.map(value => {
            const delay = Math.round(Math.random() * 2);
            const opts =  delay ? { path: '/delay/' + delay, method: 'GET' } : {};
            return imitateSentrySend(https, { ...event, value, options: opts });
        });
        await Promise.all(promises);
    }
})();

@addaleax
Copy link
Member

Correct me if I am wrong @addaleax, but this actually has nothing to do with Sentry using domain in their own codebase but actually has to do with their use of httpsAgent to send requests to their server.

@emhagman I think it’s the combination of both that’s making this problematic.

Here is my current code I am using to try to reproduce the issue. Using --gc-interval=5 and still can't seem to get the error.

Yeah, I don’t think this is going to crash the way it does in your production code without using domains…

I'm following this issue as well as I'm really interested in solving it, just one thing about the keepalive flag and the httpsagent. Ist the agent also responsible for regular requests in express?

@simllll I don’t know for sure, but I wouldn’t think that it is being used for those.

@emhagman
Copy link
Author

emhagman commented Oct 27, 2019

@addaleax Gotcha. Do I need to bind anything in particular or just be running this inside my own domain?

const debugDomain = domain.create();
debugDomain.on('error', err => {
    console.log('Caught error in domain', err);
});

(async function main() {
    debugDomain.run(async () => {
        const id = crypto.randomBytes(8).toString('hex');
        const event = { id, msg: 'Some random error has occurred, sending to Sentry...' };
        const delay = Math.round(Math.random() * 4);
        const opts = delay ? { path: '/delay/' + delay, method: 'GET' } : {};
        await imitateSentrySend(https, { ...event, options: opts });
        await main(); // run-again
    });
})();

Sorry if I seem ignorant, I have never really worked with domains directly before. My basic understanding is they're used to hold context and to prevent errors from bubbling up to the main event loop.

@addaleax
Copy link
Member

Do I need to bind anything in particular or just be running this inside my own domain?

@emhagman I would think that that’s enough but that doesn’t really mean that I’m sure that it’ll work as a reproduction. (I’ll try to put one together myself, too – likely tomorrow though.)

My basic understanding is they're used to hold context and to prevent errors from bubbling up to the main event loop.

Yeah, that was kind of the original idea behind them when they were introduced.

@simllll
Copy link
Contributor

simllll commented Oct 27, 2019

@emhagman I'm only on my phone, but my thoughts would be to send the error (your imitatesentrysend) in the on error handler of the domain, and inside the run async method just throw a lot of errors with some context (e.g. create objects that you attach to the error object). They should get picked up by the error handler, where the agent will try to send the data somewhere. I also would think of a way to make the transfer to the backend as slow as possible so that gc can kick in while sending the data. Maybe try attach some megabytes of Data to the error object so that it will take a while ;)

@emhagman
Copy link
Author

emhagman commented Oct 27, 2019

@simllll Thanks for the tips. I am probably going to take a break for a bit. This is what I currently have if anyone wants to take a stab at it. I've set it up exactly as Sentry, with the request middleware and the error middleware and used a local domain just like they do in their request middleware. It then "sends the errors" to the echo service to imitate Sentry.

I've used the same maxSockets value, timeout value, and keepAlive value that they do in their code.

With all of this, and using --gc-interval=5, I still can't seem to reproduce the error. @addaleax If I can get a core dump to prove it, would that be enough? It may be painful but I can at least eventually reproduce it in our production environment and I can figure a way out to get the core dump.

Run the app with node --gc-interval=5 app.js and then run node request.js to test it out.

app.js

const express = require('express');
const https = require('https');
const fs = require('fs');
const util = require('util');
const crypto = require('crypto');
const async_hooks = require('async_hooks');
const domain = require('domain');

// async_hooks debug dump code...
function debug(...args) {
    fs.writeFileSync('/dev/stdout', `ASYNC_HOOK_DEBUG: ${util.format(...args)}\n`, { flag: 'a' });
}

async_hooks.createHook({
    init(id, type, triggerId, resource) {
        debug('init', { id, type, triggerId, resource });
    },
    before(id) {
        debug('before', { id });
    },
    after(id) {
        debug('after', { id });
    },
    destroy(id) {
        debug('destroy', { id });
    },
}).enable();

const httpsAgent = new https.Agent({ keepAlive: false, maxSockets: 30, timeout: 2000 });
const httpsOpts = {
    protocol: 'https:',
    method:   'POST',
    hostname: 'postman-echo.com',
    path:     '/post',
    port:     443,
    agent:    httpsAgent
};

function imitateSentrySend(httpModule, event) {
    return new Promise((resolve, reject) => {
        const opts = Object.assign(httpsOpts, event.options);
        const req = https.request(opts, function(res) {
            res.setEncoding('utf8');
            if (res.statusCode >= 200 && res.statusCode < 300) {
                resolve({ status: res.statusCode });
            } else {
                reject(new Error(`Request failed with statusCode: ${res.statusCode}`));
            }
            res.on('data', () => {});
            res.on('end', () => {});
        });
        req.on('error', reject);
        req.end(JSON.stringify(event));
    });
}

const app = express();

// Fake Sentry RequestHandler (binds domain)
app.use(function(req, res, next) {
    const local = domain.create();
    local.bind(req);
    local.bind(res);
    local.on('error', next);
    local.run(() => {
        next();
    });
});

// Normal express route
app.get('/debug-sentry', function(req, res) {
    const error = new Error('Random error');
    error.randomData = crypto.randomBytes(1024 * 1024); // 1mb
    throw error;
});

// Fake ErrorHandler, always sending event
app.use(function(err, req, res, next) {
    if (err) {
        const id = crypto.randomBytes(8).toString('hex');
        const event = { id, msg: 'Some random error has occurred, sending to Sentry...' };
        imitateSentrySend(https, { ...event, data: err.randomData }); // they don't await in their code either
        next(err);
    } else {
        next();
    }
});

app.listen(9999);

request.js

const http = require('http');

function r() {
    return new Promise((resolve, reject) => {
        const req = http.get('http://localhost:9999/debug-sentry', (resp) => resolve(resp));
        req.on('error', reject);
    });
}

(async function main() {
    while (true) {
        const promises = Array(100).fill(0).map(() => r());
        try {
            await Promise.all(promises);
        } catch (err) {
            console.log('Failed to resolve promises', err);
        }
    }
})();

@addaleax
Copy link
Member

@addaleax If I can get a core dump to prove it, would that be enough? It may be painful but I can at least eventually reproduce it in our production environment and I can figure a way out to get the core dump.

Fwiw, I’m somewhat optimistic about being able to put together a reproduction myself, and then using this as a regression test. In the worst case, we could still apply something like the patch above and hope for the best, although it would of course be nice to actually confirm that it does fix a real issue.

@emhagman
Copy link
Author

emhagman commented Oct 29, 2019

@addaleax Sounds good! I didn't know if you needed definitive proof to able to put it into the codebase which is why I was asking. I know I keep saying it but thanks for the help!

This is the only thing preventing us from upgrading to the higher number 12.X and we'd love to be able to use the cgroups memory setting that landed in 12.7.0 https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#2019-07-23-version-1270-current-targos since we use Docker extensively and also take advantage of the startup time improvements from the V8 snapshots.

@emhagman
Copy link
Author

@saniagh Ran into this issue without Sentry. Of note, we both use node:alpine on Docker.

@saniagh
Copy link

saniagh commented Oct 30, 2019

@saniagh Ran into this issue without Sentry. Of note, we both use node:alpine on Docker.

Indeed, I've also checked to see if any module uses the https.Agent as @emhagman suggested and I have found none.

I've upgraded our version of node to v13.0.1 in development to try to reproduce the bug. ( I chose to match our production version )

I'll be back if I get something.

addaleax added a commit to addaleax/node that referenced this issue Oct 31, 2019
If `asyncReset()` is used to specify an alternative resource object
to mark a re-used socket in the HTTP Agent implementation,
store that object and keep it alive, because domains rely on GC tracking
for resource objects to manage their own lifetimes, and previously that
resource object might have been garbage-collected too early, leading to
crashes.

Fixes: nodejs#30122
@addaleax
Copy link
Member

Alright! Put together a fix and a regression test: #30196

@emhagman Thanks for the bug report and working this out with me!

@emhagman
Copy link
Author

emhagman commented Oct 31, 2019

@addaleax Thanks so much 💯 I appreciate you taking the time to look into and fix it. Your test code is very helpful in learning how to debug/reproduce things like this in the future. Thanks!

EDIT: Any idea on how long this will take to get cut into a release for 12.X?

@addaleax
Copy link
Member

EDIT: Any idea on how long this will take to get cut into a release for 12.X?

@emhagman So, typically the way this works is that the PR takes at least 48 hours to land unless explicitly being fast-tracked (I don’t think this qualifies as a trivial change), plus the time to the next Current/13.x release (Tuesday according to nodejs/Release#487); and then the LTS rules says that commits need to have been released two weeks before being backported into LTS, but if there’s a strong case for doing so sooner, that’s usually not an issue for a low-risk patch like this one.

@emhagman
Copy link
Author

emhagman commented Nov 1, 2019

@addaleax Got it, thanks 👍

@kamilogorek
Copy link

Awesome job @addaleax, thank you! :)

MylesBorins pushed a commit that referenced this issue Nov 17, 2019
If `asyncReset()` is used to specify an alternative resource object
to mark a re-used socket in the HTTP Agent implementation,
store that object and keep it alive, because domains rely on GC tracking
for resource objects to manage their own lifetimes, and previously that
resource object might have been garbage-collected too early, leading to
crashes.

Fixes: #30122

PR-URL: #30196
Reviewed-By: Vladimir de Turckheim <vlad2t@hotmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
targos pushed a commit that referenced this issue Dec 1, 2019
If `asyncReset()` is used to specify an alternative resource object
to mark a re-used socket in the HTTP Agent implementation,
store that object and keep it alive, because domains rely on GC tracking
for resource objects to manage their own lifetimes, and previously that
resource object might have been garbage-collected too early, leading to
crashes.

Fixes: #30122

PR-URL: #30196
Reviewed-By: Vladimir de Turckheim <vlad2t@hotmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
MylesBorins pushed a commit that referenced this issue Dec 17, 2019
If `asyncReset()` is used to specify an alternative resource object
to mark a re-used socket in the HTTP Agent implementation,
store that object and keep it alive, because domains rely on GC tracking
for resource objects to manage their own lifetimes, and previously that
resource object might have been garbage-collected too early, leading to
crashes.

Fixes: #30122

PR-URL: #30196
Reviewed-By: Vladimir de Turckheim <vlad2t@hotmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
@derom
Copy link

derom commented Feb 21, 2020

We still see the issue with node:12.14.1 (and above), @sentry/node and express

TypeError: Cannot read property 'enter' of undefined
    at AsyncHook.before (domain.js:78:20)
    at emitHook (internal/async_hooks.js:168:38)

@pelzerim
Copy link

We still see this issue.

I don't understand enough about the node garbage collection, but this does not seem to work.

Keeping a strong reference myself to the domain makes the problem go away:

const pairing = new Map()

asyncHooks.createHook({
  init(id) {
    if (process.domain !== null && process.domain !== undefined) {
      pairing.set(id, process.domain)
    }
  },
  destroy(asyncId) {
    pairing.delete(asyncId)
  }
}).enable()

@Langstra
Copy link

I am also still seeing this issue. However I cannot always reproduce this result. Maybe it is a race condition or something? It occurs here: https://github.com/nodejs/node/blob/v14.18.1/lib/domain.js#L97

TypeError: Cannot read property 'enter' of undefined
    at AsyncHook.before (domain.js:97:20)
    at emitHook (internal/async_hooks.js:235:38)
    at emitBeforeScript (internal/async_hooks.js:501:5)
    at promiseBeforeHook (internal/async_hooks.js:345:3)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)

@lobsterkatie
Copy link

We ran into this in node 16 in our CI the other day:
image
https://github.com/getsentry/sentry-javascript/pull/4061/checks?check_run_id=3954472061#step:6:688

Specifically, it was in 16.12.0. Couldn't repro it locally, and restarting the job a few times fixed it, so agree that it's flaky.

@cemerick
Copy link

Can confirm that this or something like it remains in play. Same stack trace as those newer reports above, under both v17.1.0 and v16.13.0:

TypeError: Cannot read properties of undefined (reading 'enter')
    at AsyncHook.before (node:domain:97:20)
    at emitHook (node:internal/async_hooks:237:38)
    at emitBeforeScript (node:internal/async_hooks:505:5)
    at promiseBeforeHook (node:internal/async_hooks:347:3)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

However, running the same script under v12.22.7 does work as expected. I can very reliably reproduce the problem on newer node versions, so that's something.

I suppose I should open a new issue entirely......?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain Issues and PRs related to the domain subsystem.
Projects
None yet
Development

Successfully merging a pull request may close this issue.