-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
child_process: emit .send() errors in next tick #4277
Conversation
@@ -510,7 +510,7 @@ function setupChannel(target, channel) { | |||
if (typeof callback === 'function') { | |||
process.nextTick(callback, ex); | |||
} else { | |||
this.emit('error', ex); // FIXME(bnoordhuis) Defer to next tick. | |||
process.nextTick(emitErrorNT, this, ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if it's worthwhile to create the separate emitErrorNT function.
This ought to work:
process.nextTick(()=>this.emit('error', ex));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, more explicit 👍 Saw this in a lot of other modules, so thought it was the preferred pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're starting to move in the direction of using arrow functions where possible to improve readability and, in some cases, performance. Either approach would work here but the arrow function just seems more explicit
c52a93b
to
1c56c5d
Compare
Yep. |
@@ -672,7 +672,6 @@ function setupChannel(target, channel) { | |||
return control; | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary whitespace change.
One comment but LGTM otherwise if CI is happy. |
1c56c5d
to
6091ae6
Compare
Updated, thanks for reviewing! CI didn't pass, I couldn't find any hints in the Jenkins console output.. Any other places to look for hints? |
@phillipj Here's one of the errors (from https://ci.nodejs.org/job/node-test-binary-windows/416/RUN_SUBSET=0,VS_VERSION=vs2015,label=win10/tapTestReport/test.tap-15/)
I checked previous CI runs and this wasn't failing in those. So I'm going to assume it's related, but not sure why. |
Thanks @trevnorris. I'll get a win10 machine setup and see if I can reproduce that failure. |
Alright, so luckily no problem in reproducing the windows failure. Other than banging my head against the wall, I haven't made much progress. I've tried to figure out what exactly the Me being a C/C++ rookie in combination with not knowing Visual Studio much, doesn't help either. I've got VS set up building node from source, but I haven't figured out how to run Would appreciate any pointers in the right direction here! E.g. what I should pass as the |
@phillipj ... any luck on this so far? @trevnorris @bnoordhuis ... any ideas? |
6091ae6
to
f57d87e
Compare
@jasnell afraid not. Just pushed an update to clarify what I was trying to do with |
What |
Thanks @saghul! I'll give it a shot |
f57d87e
to
285768d
Compare
Finally got the tests passing locally on windows 🎆 After @saghul's comment and code in libuv hinting about a parent process had to exist in windows to get this to pass, I did a test refactor to actually get valid IPCs created. Long story short, the tests now use Could you do a CI run again, if it looks good @trevnorris? |
285768d
to
6342838
Compare
6342838
to
d5bf283
Compare
Linting errors, that's just embarrassing - sorry about that. Synced against master and fixed style issues against updated rulesset. The ARM test which failed (https://ci.nodejs.org/job/node-test-binary-arm/680/RUN_SUBSET=4,nodes=pi1-raspbian-wheezy/tapTestReport/test.tap-55/) seems to have a flaky history; #4055. @Trott have you seen this test being flaky after your PR got merged? |
@phillipj No recent failures that I'm aware of. I'm going to go back and look at the change that test was written for and see if there's a way to test it that doesn't involve the "if it takes longer than one second, the bug is there" test which is going to be inherently flaky. |
@phillipj Honestly, I think that test isn't actually valuable and should probably just be removed. But I'll take that up in a separate issue and/or PR. |
test-http-exit-delay has a flaky history. Examination of the bug it was written to find suggests that the test should be removed. * The test is trying to find a delay of up to 1 second, but the delay may also be much smaller than that. So the test will not catch the bug all the time. It is therefore flaky when the bug exists. * Experience has shown that the test is flaky as well when the bug is absent in the code because it can sometimes take slower devices (such as the Raspberry Pi devices that we have in continuous integration) more than the allowed one second to run. Increasing the timeout for those devices will make the test pass but will also mean that the test isn't really testing anything because the delay it is trying to catch was a delay of up to one second. I don't think this is an appropriate test to run once on CI. If there is to be a test for the issue in question, it should be a benchmark test that is run a large number of times. We don't really have such tests in CI yet I would argue that this test is actively a problem. It does not reliably catch the issue it is supposed to catch, nor can it likely be made to do so. (To do so would likely require converting it to a benchmarking test as previously described. We don't run those in CI, at least not at this time.) Because this test will have both false positives and false negatives, especially on the slower devices, it contributes to a culture of dismissing failed tests. It does not reliably identify an issue nor does it reliably pass on a working code base. This test should be removed. Ref: nodejs#4277
Thanks for opening a PR deleting the test in lightning speed @Trott 👍 |
test-http-exit-delay has a flaky history. Examination of the bug it was written to find suggests that the test should be removed. * The test is trying to find a delay of up to 1 second, but the delay may also be much smaller than that. So the test will not catch the bug all the time. It is therefore flaky when the bug exists. * Experience has shown that the test is flaky as well when the bug is absent in the code because it can sometimes take slower devices (such as the Raspberry Pi devices that we have in continuous integration) more than the allowed one second to run. Increasing the timeout for those devices will make the test pass but will also mean that the test isn't really testing anything because the delay it is trying to catch was a delay of up to one second. I don't think this is an appropriate test to run once on CI. If there is to be a test for the issue in question, it should be a benchmark test that is run a large number of times. We don't really have such tests in CI yet I would argue that this test is actively a problem. It does not reliably catch the issue it is supposed to catch, nor can it likely be made to do so. (To do so would likely require converting it to a benchmarking test as previously described. We don't run those in CI, at least not at this time.) Because this test will have both false positives and false negatives, especially on the slower devices, it contributes to a culture of dismissing failed tests. It does not reliably identify an issue nor does it reliably pass on a working code base. This test should be removed. Ref: #4277 PR-URL: #4786 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Brian White <mscdex@mscdex.net>
test-http-exit-delay has a flaky history. Examination of the bug it was written to find suggests that the test should be removed. * The test is trying to find a delay of up to 1 second, but the delay may also be much smaller than that. So the test will not catch the bug all the time. It is therefore flaky when the bug exists. * Experience has shown that the test is flaky as well when the bug is absent in the code because it can sometimes take slower devices (such as the Raspberry Pi devices that we have in continuous integration) more than the allowed one second to run. Increasing the timeout for those devices will make the test pass but will also mean that the test isn't really testing anything because the delay it is trying to catch was a delay of up to one second. I don't think this is an appropriate test to run once on CI. If there is to be a test for the issue in question, it should be a benchmark test that is run a large number of times. We don't really have such tests in CI yet I would argue that this test is actively a problem. It does not reliably catch the issue it is supposed to catch, nor can it likely be made to do so. (To do so would likely require converting it to a benchmarking test as previously described. We don't run those in CI, at least not at this time.) Because this test will have both false positives and false negatives, especially on the slower devices, it contributes to a culture of dismissing failed tests. It does not reliably identify an issue nor does it reliably pass on a working code base. This test should be removed. Ref: #4277 PR-URL: #4786 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Brian White <mscdex@mscdex.net>
test-http-exit-delay has a flaky history. Examination of the bug it was written to find suggests that the test should be removed. * The test is trying to find a delay of up to 1 second, but the delay may also be much smaller than that. So the test will not catch the bug all the time. It is therefore flaky when the bug exists. * Experience has shown that the test is flaky as well when the bug is absent in the code because it can sometimes take slower devices (such as the Raspberry Pi devices that we have in continuous integration) more than the allowed one second to run. Increasing the timeout for those devices will make the test pass but will also mean that the test isn't really testing anything because the delay it is trying to catch was a delay of up to one second. I don't think this is an appropriate test to run once on CI. If there is to be a test for the issue in question, it should be a benchmark test that is run a large number of times. We don't really have such tests in CI yet I would argue that this test is actively a problem. It does not reliably catch the issue it is supposed to catch, nor can it likely be made to do so. (To do so would likely require converting it to a benchmarking test as previously described. We don't run those in CI, at least not at this time.) Because this test will have both false positives and false negatives, especially on the slower devices, it contributes to a culture of dismissing failed tests. It does not reliably identify an issue nor does it reliably pass on a working code base. This test should be removed. Ref: #4277 PR-URL: #4786 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Brian White <mscdex@mscdex.net>
What's your best advice on trigging flaky tests @Trott? Finally got a native win10 setup which made things alot easier to work with, but haven't been able to reproduce that previous cluster failure yet. So far I've repeated these two quite alot by hand:
but I'm sure there's better ways of doing it.. |
@phillipj On Windows, I've done two different inelegant techniques:
It may be worthwhile to run a stress test on CI against your branch to confirm that the test is still failing in CI. |
LGTM |
Thanks @Trott! Was not able to reproduce any failures with the script below :/ Would appreciate a CI run to see if it pops out again. @echo off
for /l %%x in (1, 1, 100) do (
Release\node.exe test\parallel\test-cluster-shared-leak.js
echo Run #%%x: %errorlevel%
) |
@phillipj Three stress tests now running on three different varieties of Windows machines: |
@phillipj So far, 2 failures in 590 runs at https://ci.nodejs.org/job/node-stress-single-test/521/nodes=win10-1p/console. Key piece of info that might explain why you're not seeing the issue on your own machine: That's a single-processor host. Same exact thing on a multiprocessor host has 1580 runs so far with 0 failures. Failures look like this:
I'm stress testing the code from your branch in this PR. I suppose I could/should run a stress test against master too. If the issue already exists there, then maybe we mark it as flaky (again, insert heavy sigh here) and move on. If the issue does not exist in master, then OMG, please don't land this. :-) |
@Trott nice! I don't have access to CI cause of the security lockdown If these failures doesn't show up on CI / master, I could ofc try to On Monday, 22 February 2016, Rich Trott notifications@github.com wrote:
|
Now stress testing against master: https://ci.nodejs.org/job/node-stress-single-test/522/nodes=win10-1p/console |
@phillipj 0 failures on master branch so far. 1500 runs and counting. (I set it for 9999 total.) In the meantime, I'm not sure how you might best emulate the stress test in your local environment. @joaocgreis may have some ideas or suggestions. |
6197 runs without a failure on current master. Either this PR makes the test flaky or something odd is going on (like there's an issue on one Azure host that isn't on the another Azure host). |
Sigh, rough nut to crack this one. Will have to assume it's this branch making the test flaky. I'll try to get my hands on a single CPU machine and hopefully get to reproduce the failure. |
@phillipj the best way I found so far was with a virtual machine. You can set the number of processors, and with VirtualBox you can even set a CPU cap, while it's running. I've had a test running in loop and made it fail and pass by setting the cap to 10% and back. I could give you the exact script |
I confirmed on my Windows 10 VM (running single core, no cap) that this makes the test flaky. It is not a problem with the slaves. |
@joaocgreis much appreciated! |
@joaocgreis You confirmed that @phillipj's branch makes it flaky? If it's not too much trouble and if you didn't already do it, can you confirm that it's not flaky on master? Don't want @phillipj to have to do a bunch of work if it's not his code causing the issue... |
It is not flaky on master, run 10k times without failing. For this PR's head, it failed twice in 1000 times. |
test-http-exit-delay has a flaky history. Examination of the bug it was written to find suggests that the test should be removed. * The test is trying to find a delay of up to 1 second, but the delay may also be much smaller than that. So the test will not catch the bug all the time. It is therefore flaky when the bug exists. * Experience has shown that the test is flaky as well when the bug is absent in the code because it can sometimes take slower devices (such as the Raspberry Pi devices that we have in continuous integration) more than the allowed one second to run. Increasing the timeout for those devices will make the test pass but will also mean that the test isn't really testing anything because the delay it is trying to catch was a delay of up to one second. I don't think this is an appropriate test to run once on CI. If there is to be a test for the issue in question, it should be a benchmark test that is run a large number of times. We don't really have such tests in CI yet I would argue that this test is actively a problem. It does not reliably catch the issue it is supposed to catch, nor can it likely be made to do so. (To do so would likely require converting it to a benchmarking test as previously described. We don't run those in CI, at least not at this time.) Because this test will have both false positives and false negatives, especially on the slower devices, it contributes to a culture of dismissing failed tests. It does not reliably identify an issue nor does it reliably pass on a working code base. This test should be removed. Ref: #4277 PR-URL: #4786 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Brian White <mscdex@mscdex.net>
This changes the behaviour of error events emitted when process.send() fails, from synchronously to asynchronously.
ab5589e
to
3c38e87
Compare
I assume this PR should be closed in favor of #5251 which fixes a lot more of these inconsistent emits. Though as I mentioned in #5251 (comment) it would be nice to ensure this exact change doesn't merge flakyness into master. Capping the CPU as @joaocgreis mentioned worked like a charm. It was easy to reproduce while being in sync with master around 21th january with 922d538 as parent commit. Being in sync with master today seems to fix the issue though, as I've not been able to reproduce it when rebasing on top of master. I've updated the PR and would appreciate a new stress test to verify it's okey now, as a means of removing the uncertainty from #5251. @Trott would you be so kind? |
@phillipj Are you asking for a stress test for this PR on Windows? |
@Trott Yes. On second thought, I could start the stress test myself now. I'll bug you on IRC for build details if that's okey. |
Stress test passed: https://ci.nodejs.org/job/node-stress-single-test/nodes=win10-1p/553/ Closing this in favor of #5251. Will reopen if that PR doesn't land. |
@phillipj Cool. Thanks for confirming that this indeed works as expected :-) |
test-http-exit-delay has a flaky history. Examination of the bug it was written to find suggests that the test should be removed. * The test is trying to find a delay of up to 1 second, but the delay may also be much smaller than that. So the test will not catch the bug all the time. It is therefore flaky when the bug exists. * Experience has shown that the test is flaky as well when the bug is absent in the code because it can sometimes take slower devices (such as the Raspberry Pi devices that we have in continuous integration) more than the allowed one second to run. Increasing the timeout for those devices will make the test pass but will also mean that the test isn't really testing anything because the delay it is trying to catch was a delay of up to one second. I don't think this is an appropriate test to run once on CI. If there is to be a test for the issue in question, it should be a benchmark test that is run a large number of times. We don't really have such tests in CI yet I would argue that this test is actively a problem. It does not reliably catch the issue it is supposed to catch, nor can it likely be made to do so. (To do so would likely require converting it to a benchmarking test as previously described. We don't run those in CI, at least not at this time.) Because this test will have both false positives and false negatives, especially on the slower devices, it contributes to a culture of dismissing failed tests. It does not reliably identify an issue nor does it reliably pass on a working code base. This test should be removed. Ref: nodejs#4277 PR-URL: nodejs#4786 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Brian White <mscdex@mscdex.net>
Triaging a couple of
FIXME
s from @bnoordhuis in internal/child_process.Should this be considered a fix for the next major release as it changes from emitting errors synchronously, to async in the next tick?