workaround for Mutex::Lock not being signal-safe #24

AndreasMadsen · 2018-01-31T11:22:56Z

mafintosh · 2018-01-31T11:24:40Z

LGTM

AlanSl · 2018-01-31T11:49:44Z

Is this regarding https://github.com/nearform/node-clinic-bubbleprof/issues/22 ? If it's not and I've misunderstood the intention, ignore this comment :-)

If so, testing it against the issues there (node 9.4.0 and Windows 10):

After the 'make it less visible commit', behaviour is unfortunately still the same as described in that issue
Before that commit, first commit only:
- On low intensity runs that don't reach the buffer flush limit, the log file is now created in time, but the data isn't written before the rename and I get the same single empty cluster as before (screenshot below)
- For a run that exceeds the trace buffer, flushes, writes a log file, then has additional requests that don't cause another flush, the problem is the same: the renamed traceevent file is still exactly the same size as the log file was before the additional requests came in

Screenshot of example empty cluster node expanded in dev tools:

AlanSl · 2018-01-31T11:58:41Z

Also in my testing it seems like the trace buffer is properly flushed after the process exit triggers StopTracingAgent - the problem is, our 'exit' callback doesn't wait for everything in the inner process exit to have completed (in particular, the write to the log file).

This is why the setTimeout hack works - the trace buffer is flushing, and we do get the buffer data if we wait for it. The problem is how can we properly wait for it.

AndreasMadsen · 2018-01-31T12:11:40Z

After the 'make it less visible commit', behaviour is unfortunately still the same as described in that issue

What case are you running?

On low intensity runs that don't reach the buffer flush limit, the log file is now created in time, but the data isn't written before the rename and I get the same single empty cluster as before (screenshot below)

Did you verify this by removing the rename code? It makes no sense at all. My guess is that you are assuming the data is written but actually it isn't. Rename or not.

For a run that exceeds the trace buffer, flushes, writes a log file, then has additional requests that don't cause another flush, the problem is the same: the renamed traceevent file is still exactly the same size as the log file was before the additional requests came in

Yes, this is the flushing issue caused by "Mutex::Lock not being signal-safe".

Also in my testing it seems like the trace buffer is properly flushed after the process exit triggers StopTracingAgent - the problem is, our callback on the outer process exit doesn't wait for everything in the inner process exit to have completed (in particular, the write to the log file).

I seriously doubt that, because we wait until the process exits (meaning no code can be executed). It could be that on Windows the trace_events thread somehow continues to live after the parent process has exited and we are then giving the trace_events thread some time to finish too. However, the root problem here is a that the host process exists too early, not that we are calling rename too early.

If this is the case setTimeout() should only "fix" the issue on Windows, as Windows has a different policy for subprocess detachment.

AlanSl · 2018-01-31T12:19:44Z

Test cases:

Buffer not exceeded, no data: node-clinic-examples/slow-io and autocannon -c 4 -d 5 http://localhost:3000
Data not copied from buffer before rename: node-clinic-examples/slow-io and autocannon -c 28 -d 5 http://localhost:3000, note the log file size, autocannon -c 2 -d 5 http://localhost:3000, wait, then ctrl-c the process

I'll run the same tests in my Ubuntu VM, see if it's an OS thing.

AlanSl · 2018-01-31T12:54:03Z

Works perfectly on my Ubuntu VM.

So, LGTM and I can work around the issue still existing in Windows for now.

Buffer not exceeded, no data: Fixed, generates profiles with normal looking data even with just one request
Data not copied from buffer before rename: Fixed, traceevents file size is larger than log file was if additional requests came in without causing a buffer flush

AndreasMadsen · 2018-01-31T13:18:56Z

Thanks. Then it properly have something to do with signals on windows ( or lack of that ;) )

AlanSl · 2018-01-31T13:29:28Z

Based on your suggestion of removing the rename code I ran some quick tests in Windows 10 / Node 9.4.0 with the fs.rename in index.js commented out, in the no-buffer-flush case (node-clinic-examples/slow-io and C:\Users\Alan>autocannon -c 4 -d 5 http://localhost:3000):

1: On both branches (sigint-hack and Master) with setTimeout:

a log file of around the expected size is created (2,776 kb and 2,948 kb)

2: On both branches (sigint-hack and Master) without setTimeout:

no log file is created for either branch

3: With only the first commit from this branch applied and without setTimeout:

a log file is created but no data is written to it

4: On this branch (sigint-hack) with proc.kill('SIGINT') commented out completely:

a log file of around the expected size is created (2,264 kb)
if the rename isn't commented out, the profile generates as normal, with normal-looking clusters

So shall I add a commit to this branch like below (will make this workaround fix both Windows and Linux/Mac)? Or as a seperate PR after this one is merged?

    // relay SIGINT to process unless this is Windows
    /* istanbul ignore else: windows hack */
    if (os.platform() !== 'win32') {
      process.once('SIGINT', () => proc.kill('SIGINT'))
    }

[EDIT] Curiously, that change causes nothing to happen at all to happen in Windows, but this below works perfectly. No idea why having an event handler that doesn't do anything would make all the difference (maybe the handler intercepts it from passing through automatically):

    // relay SIGINT to process unless this is Windows
    process.once('SIGINT', () => {
      /* istanbul ignore else: windows hack */
      if (os.platform() !== 'win32') {
        proc.kill('SIGINT')
      }
    })

AndreasMadsen · 2018-01-31T13:48:54Z

4: On this branch (sigint-hack) with proc.kill('SIGINT') commented out completely.

Right. We forgot to apply that a fix for that in bubbleprof too. I added a commit for that.

AndreasMadsen force-pushed the sigint-hack branch from e84a006 to eebf9b5 Compare January 31, 2018 11:56

AndreasMadsen added 2 commits January 31, 2018 14:50

workaround for Mutex::Lock not being signal-safe

f3c8404

rely on Windows Ctrl+C process-group propergation

163d9ba

AndreasMadsen force-pushed the sigint-hack branch from dbc47c8 to 163d9ba Compare January 31, 2018 13:50

Merge branch 'master' into sigint-hack

e153798

AndreasMadsen merged commit bd79029 into master Feb 1, 2018

AlanSl deleted the sigint-hack branch May 4, 2018 12:19

AlanSl restored the sigint-hack branch May 4, 2018 12:19

AlanSl deleted the sigint-hack branch May 4, 2018 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workaround for Mutex::Lock not being signal-safe #24

workaround for Mutex::Lock not being signal-safe #24

AndreasMadsen commented Jan 31, 2018 •

edited

Loading

mafintosh commented Jan 31, 2018

AlanSl commented Jan 31, 2018 •

edited

Loading

AlanSl commented Jan 31, 2018 •

edited

Loading

AndreasMadsen commented Jan 31, 2018 •

edited

Loading

AlanSl commented Jan 31, 2018

AlanSl commented Jan 31, 2018 •

edited

Loading

AndreasMadsen commented Jan 31, 2018

AlanSl commented Jan 31, 2018 •

edited

Loading

AndreasMadsen commented Jan 31, 2018 •

edited

Loading

workaround for Mutex::Lock not being signal-safe #24

workaround for Mutex::Lock not being signal-safe #24

Conversation

AndreasMadsen commented Jan 31, 2018 • edited Loading

mafintosh commented Jan 31, 2018

AlanSl commented Jan 31, 2018 • edited Loading

AlanSl commented Jan 31, 2018 • edited Loading

AndreasMadsen commented Jan 31, 2018 • edited Loading

AlanSl commented Jan 31, 2018

AlanSl commented Jan 31, 2018 • edited Loading

AndreasMadsen commented Jan 31, 2018

AlanSl commented Jan 31, 2018 • edited Loading

AndreasMadsen commented Jan 31, 2018 • edited Loading

AndreasMadsen commented Jan 31, 2018 •

edited

Loading

AlanSl commented Jan 31, 2018 •

edited

Loading

AlanSl commented Jan 31, 2018 •

edited

Loading

AndreasMadsen commented Jan 31, 2018 •

edited

Loading

AlanSl commented Jan 31, 2018 •

edited

Loading

AlanSl commented Jan 31, 2018 •

edited

Loading

AndreasMadsen commented Jan 31, 2018 •

edited

Loading