Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: test-cluster-send-handle-large-payload intermittent failures #14747

Closed
matthewloring opened this issue Aug 10, 2017 · 18 comments
Closed

test: test-cluster-send-handle-large-payload intermittent failures #14747

matthewloring opened this issue Aug 10, 2017 · 18 comments
Labels
cluster Issues and PRs related to the cluster subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. macos Issues and PRs related to the macOS platform / OSX.

Comments

@matthewloring
Copy link

  • Version: master
  • Platform: OS X
  • Subsystem: Test

This test seems to be timing out sometimes. Introduced by #14588.

\cc @addaleax

@matthewloring matthewloring added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Aug 10, 2017
@mscdex mscdex added the cluster Issues and PRs related to the cluster subsystem. label Aug 10, 2017
@addaleax
Copy link
Member

Hm, I ran a stress test on the original PR because it had been flaky before, but that came back green, and a local stress test worked fine for me too, so I assume this is something platform-specific. So, uh, here are two stress tests to narrow that down a bit:

OS X: https://ci.nodejs.org/job/node-stress-single-test/1367/
Linux: https://ci.nodejs.org/job/node-stress-single-test/1368/

@matthewloring
Copy link
Author

I think @bnoordhuis has experienced this too (#14730 (review)). He may have suggestions on reproducing it. For me it happens fairly infrequently (~1 in 20 runs).

@addaleax
Copy link
Member

@nodejs/platform-macos … I don’t have access to OS X myself, but I am available to anybody who wants me do help debug my mistakes ;)

@addaleax addaleax added the macos Issues and PRs related to the macOS platform / OSX. label Aug 10, 2017
@Trott
Copy link
Member

Trott commented Aug 10, 2017

I can replicate this locally. This might work on whatever your local OS is as well to replicate it:

$ tools/test.py -j 96 --repeat 192 test/parallel/test-cluster-send-handle-large-payload.js 
=== release test-cluster-send-handle-large-payload ===                    
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===                   
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
[02:11|% 100|+ 179|-  13]: Done 
$

First 96 tests pass in about 11 seconds, and then it pauses for two minutes as some of the tests (13 above) time out.

Easiest/fastest solution would be to move the test to sequential, but of course that might not be the best solution.
¯\(ツ)

@addaleax
Copy link
Member

Easiest/fastest solution would be to move the test to sequential, but of course that might not be the best solution.

Every time somebody proposes that, I think either the test is broken or Node is broken. ;) I don’t think there’s a reason to assume this test should be flaky under load.

@addaleax
Copy link
Member

Also, no, no luck reproducing under Linux, no matter how many parallel jobs.

@Trott
Copy link
Member

Trott commented Aug 10, 2017

Also, no, no luck reproducing under Linux, no matter how many parallel jobs.

Looks like -j doesn't even have that much to do with it other than making it happen more reliably. I'm seeing it with just tools/test.py -j 1 --repeat 192 test/parallel/test-cluster-send-handle-large-payload.js

@addaleax
Copy link
Member

@Trott Any chance you could do some digging and figure out whether it’s the parent or the child process that’s hanging? And which handles are keeping it open (assuming it’s stuck with 0 % CPU, not 100 % CPU)?

@Trott
Copy link
Member

Trott commented Aug 10, 2017

@addaleax Yes, looking right now...might have to stop for a few meetings, but will pick up again later if so...

@Trott
Copy link
Member

Trott commented Aug 10, 2017

When this times out, the process.send() in the subprocess is getting called, but the message event on worker is not being emitted (or at least the listener is not being invoked).

If I add a callback to process.send(), it never indicates an error in this test, whether the test succeeds or times out.

Judging from #6767, process.send() may be fire-and-forget. Maybe the message gets received and maybe not.

I don't know why macOS would be more susceptible to missing the message than anything else. If this is not-a-bug behavior, I guess we can add a retry. If this is a bug... ¯\(ツ)

The .send() from the parent process to the subprocess never seems to fail.

@addaleax
Copy link
Member

Okay, so this might be fix-able by adding an extra round-trip. I can try that on the weekend if nobody beats me to it.

@Trott
Copy link
Member

Trott commented Aug 10, 2017

If the process.send() message getting dropped from time to time is a case of working-as-expected, I have a fix for the test ready to go. If, on the other hand, that's a bug in process.send() that might even be specific to macOS, then uh, yeah, that's gonna take some deeper looking.

@addaleax
Copy link
Member

@Trott I think the discussion in the issue you linked means that, if the problem actually is “the process exits before process.send() can finish”, then yes, that may be things working as expected.

(I’m not convinced that issue shouldn’t be re-opened, but that’s another story.)

@Trott
Copy link
Member

Trott commented Aug 11, 2017

I think the discussion in the issue you linked means that, if the problem actually is “the process exits before process.send() can finish”,

@addaleax Unfortunately, that doesn't appear to be what's going on here. :-( If I keep the event loop open in the child process, the message is still sometimes never received and the test times out.

(Interestingly, if I add a useless console.log() in the right place, I can get the test to be reliable. I've seen this troubling behavior manifest before. Someone once gave me an explanation for it, but I can't remember what it was right now. But maybe that hints in the right direction?)

EDIT: (Yeah, the "I have a fix ready to go" above...not so much.)

@addaleax
Copy link
Member

Interestingly, if I add a useless console.log() in the right place, I can get the test to be reliable.

That sounds weird, I don’t think I have heard of that and I’m not sure what that means.

dtruss -f output for the test might be nice, but I’d be too tired to really dig into it right now anyway. Have fun playing around with the test if you like :)

@Trott
Copy link
Member

Trott commented Aug 11, 2017

Just so I don't forget what it was: Put a console.log() statement as the last line of the process.on('message',...) handler and the test is no longer flaky.
(╯°□°)╯︵ ┻━┻

@Trott
Copy link
Member

Trott commented Aug 11, 2017

OK, good news is I have a simple fix.

Bad news is that I suspect it covers up a legitimate bug. Not sure, though.

Basically, if the child process waits a bit before sending the payload back, the message is always received, and the test passes reliably.

That seems like a code smell to me. I thought I was pretty thorough in checking for events I might listen to that might seem like less a less code-smelly approach, but I'm going to go look again.

Trott added a commit to Trott/io.js that referenced this issue Aug 12, 2017
On macOS, the parent process might not receive a message if it
is sent to soon, and then subsequent messages are also sometimes not
received.

(Is this a bug or expected operating system behavior like the
way a file watcher is returned before it's actually watching the file
system on/ macOS?)

Send a second message after a delay on macOS.

While at it, minor refactoring to the test:

* Blank line after loading `common` module per test-writing guide
* Wrap arrow function in braces where implicit return is not needed
* Remove unnecessary unref in subprocess

Fixes: nodejs#14747
@Trott
Copy link
Member

Trott commented Aug 12, 2017

Fix (if it is indeed an OS quirk and not a bug) in #14780.

@Trott Trott closed this as completed in 28a47aa Aug 14, 2017
MylesBorins pushed a commit that referenced this issue Sep 10, 2017
On macOS, the parent process might not receive a message if it
is sent to soon, and then subsequent messages are also sometimes not
received.

(Is this a bug or expected operating system behavior like the
way a file watcher is returned before it's actually watching the file
system on/ macOS?)

Send a second message after a delay on macOS.

While at it, minor refactoring to the test:

* Blank line after loading `common` module per test-writing guide
* Wrap arrow function in braces where implicit return is not needed
* Remove unnecessary unref in subprocess

PR-URL: #14780
Fixes: #14747
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. macos Issues and PRs related to the macOS platform / OSX.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants