Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workaround for Mutex::Lock not being signal-safe #24

Merged
merged 3 commits into from
Feb 1, 2018

Conversation

AndreasMadsen
Copy link
Contributor

@AndreasMadsen AndreasMadsen commented Jan 31, 2018

@mafintosh
Copy link
Contributor

LGTM

@AlanSl
Copy link
Contributor

AlanSl commented Jan 31, 2018

Is this regarding https://github.com/nearform/node-clinic-bubbleprof/issues/22 ? If it's not and I've misunderstood the intention, ignore this comment :-)

If so, testing it against the issues there (node 9.4.0 and Windows 10):

  • After the 'make it less visible commit', behaviour is unfortunately still the same as described in that issue
  • Before that commit, first commit only:
    • On low intensity runs that don't reach the buffer flush limit, the log file is now created in time, but the data isn't written before the rename and I get the same single empty cluster as before (screenshot below)
    • For a run that exceeds the trace buffer, flushes, writes a log file, then has additional requests that don't cause another flush, the problem is the same: the renamed traceevent file is still exactly the same size as the log file was before the additional requests came in

Screenshot of example empty cluster node expanded in dev tools:

@AlanSl
Copy link
Contributor

AlanSl commented Jan 31, 2018

Also in my testing it seems like the trace buffer is properly flushed after the process exit triggers StopTracingAgent - the problem is, our 'exit' callback doesn't wait for everything in the inner process exit to have completed (in particular, the write to the log file).

This is why the setTimeout hack works - the trace buffer is flushing, and we do get the buffer data if we wait for it. The problem is how can we properly wait for it.

@AndreasMadsen
Copy link
Contributor Author

AndreasMadsen commented Jan 31, 2018

After the 'make it less visible commit', behaviour is unfortunately still the same as described in that issue

What case are you running?

On low intensity runs that don't reach the buffer flush limit, the log file is now created in time, but the data isn't written before the rename and I get the same single empty cluster as before (screenshot below)

Did you verify this by removing the rename code? It makes no sense at all. My guess is that you are assuming the data is written but actually it isn't. Rename or not.

For a run that exceeds the trace buffer, flushes, writes a log file, then has additional requests that don't cause another flush, the problem is the same: the renamed traceevent file is still exactly the same size as the log file was before the additional requests came in

Yes, this is the flushing issue caused by "Mutex::Lock not being signal-safe".

Also in my testing it seems like the trace buffer is properly flushed after the process exit triggers StopTracingAgent - the problem is, our callback on the outer process exit doesn't wait for everything in the inner process exit to have completed (in particular, the write to the log file).

I seriously doubt that, because we wait until the process exits (meaning no code can be executed). It could be that on Windows the trace_events thread somehow continues to live after the parent process has exited and we are then giving the trace_events thread some time to finish too. However, the root problem here is a that the host process exists too early, not that we are calling rename too early.

If this is the case setTimeout() should only "fix" the issue on Windows, as Windows has a different policy for subprocess detachment.

@AlanSl
Copy link
Contributor

AlanSl commented Jan 31, 2018

Test cases:

  • Buffer not exceeded, no data: node-clinic-examples/slow-io and autocannon -c 4 -d 5 http://localhost:3000
  • Data not copied from buffer before rename: node-clinic-examples/slow-io and autocannon -c 28 -d 5 http://localhost:3000, note the log file size, autocannon -c 2 -d 5 http://localhost:3000, wait, then ctrl-c the process

I'll run the same tests in my Ubuntu VM, see if it's an OS thing.

@AlanSl
Copy link
Contributor

AlanSl commented Jan 31, 2018

Works perfectly on my Ubuntu VM.

So, LGTM and I can work around the issue still existing in Windows for now.

  • Buffer not exceeded, no data: Fixed, generates profiles with normal looking data even with just one request
  • Data not copied from buffer before rename: Fixed, traceevents file size is larger than log file was if additional requests came in without causing a buffer flush

@AndreasMadsen
Copy link
Contributor Author

Thanks. Then it properly have something to do with signals on windows ( or lack of that ;) )

@AlanSl
Copy link
Contributor

AlanSl commented Jan 31, 2018

Based on your suggestion of removing the rename code I ran some quick tests in Windows 10 / Node 9.4.0 with the fs.rename in index.js commented out, in the no-buffer-flush case (node-clinic-examples/slow-io and C:\Users\Alan>autocannon -c 4 -d 5 http://localhost:3000):

1: On both branches (sigint-hack and Master) with setTimeout:

  • a log file of around the expected size is created (2,776 kb and 2,948 kb)

2: On both branches (sigint-hack and Master) without setTimeout:

  • no log file is created for either branch

3: With only the first commit from this branch applied and without setTimeout:

  • a log file is created but no data is written to it

4: On this branch (sigint-hack) with proc.kill('SIGINT') commented out completely:

  • a log file of around the expected size is created (2,264 kb)
  • if the rename isn't commented out, the profile generates as normal, with normal-looking clusters

So shall I add a commit to this branch like below (will make this workaround fix both Windows and Linux/Mac)? Or as a seperate PR after this one is merged?

    // relay SIGINT to process unless this is Windows
    /* istanbul ignore else: windows hack */
    if (os.platform() !== 'win32') {
      process.once('SIGINT', () => proc.kill('SIGINT'))
    }

[EDIT] Curiously, that change causes nothing to happen at all to happen in Windows, but this below works perfectly. No idea why having an event handler that doesn't do anything would make all the difference (maybe the handler intercepts it from passing through automatically):

    // relay SIGINT to process unless this is Windows
    process.once('SIGINT', () => {
      /* istanbul ignore else: windows hack */
      if (os.platform() !== 'win32') {
        proc.kill('SIGINT')
      }
    })

@AndreasMadsen
Copy link
Contributor Author

AndreasMadsen commented Jan 31, 2018

4: On this branch (sigint-hack) with proc.kill('SIGINT') commented out completely.

Right. We forgot to apply that a fix for that in bubbleprof too. I added a commit for that.

@AndreasMadsen AndreasMadsen merged commit bd79029 into master Feb 1, 2018
@AlanSl AlanSl deleted the sigint-hack branch May 4, 2018 12:19
@AlanSl AlanSl restored the sigint-hack branch May 4, 2018 12:19
@AlanSl AlanSl deleted the sigint-hack branch May 4, 2018 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants