Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test-child-process-exec-timeout on Linux #20414

Closed
Trott opened this issue Apr 30, 2018 · 10 comments
Closed

Investigate flaky test-child-process-exec-timeout on Linux #20414

Trott opened this issue Apr 30, 2018 · 10 comments

Comments

@Trott
Copy link
Member

Trott commented Apr 30, 2018

  • Version: v11.0.0-pre (master)
  • Platform: various Linux in CI
  • Subsystem: test

https://ci.nodejs.org/job/node-test-commit-linuxone/857/nodes=rhel72-s390x/consoleFull

23:45:13 not ok 192 parallel/test-child-process-exec-timeout
23:45:13   ---
23:45:13   duration_ms: 0.647
23:45:13   severity: fail
23:45:13   exitcode: 1
23:45:13   stack: |-
23:45:13     assert.js:77
23:45:13       throw new AssertionError(obj);
23:45:13       ^
23:45:13     
23:45:13     AssertionError [ERR_ASSERTION]: Input A expected to strictly equal input B:
23:45:13     + expected - actual
23:45:13     
23:45:13     - 'SIGSEGV'
23:45:13     + 'SIGTERM'
23:45:13         at cp.exec.common.mustCall (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-child-process-exec-timeout.js:34:12)
23:45:13         at /data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/common/index.js:474:15
23:45:13         at ChildProcess.exithandler (child_process.js:289:5)
23:45:13         at ChildProcess.emit (events.js:182:13)
23:45:13         at maybeClose (internal/child_process.js:957:16)
23:45:13         at Process.ChildProcess._handle.onexit (internal/child_process.js:246:5)
23:45:13   ...
@Trott
Copy link
Member Author

Trott commented May 5, 2018

Again today: https://ci.nodejs.org/job/node-test-commit-linuxone/1000/nodes=rhel72-s390x/testReport/

assert.js:77
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Input A expected to strictly equal input B:
+ expected - actual

- 'SIGSEGV'
+ 'SIGTERM'
    at cp.exec.common.mustCall (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-child-process-exec-timeout.js:34:12)
    at /data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/common/index.js:474:15
    at ChildProcess.exithandler (child_process.js:289:5)
    at ChildProcess.emit (events.js:182:13)
    at maybeClose (internal/child_process.js:957:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:246:5)

@nodejs/testing @nodejs/build @nodejs/child_process

@gireeshpunathil
Copy link
Member

reviewing the type of failures and the test case, my opinion is that the test case logic is vulnerable for race conditions and arbitrary behaviors that are platform / load dependant.

cp.exec(cmd, {timeout: 1}...
assert.strictEqual(err.signal, ...

The kill signal may be delivered:

  • while the child has not yet execed into the new process (post fork phase)
  • while the child has not yet loaded fully (pre main phase)
  • while the child has completed its task and is in the exit routine
  • while child has vanished, having completed everything!

Which of this is going to happen will depend on:

  • who gets the current CPU post fork: parent or child?
  • how loaded are the system? are there spare CPUs for the child / parent?
  • does one of parent / child gets de-scheduled immediately after the fork?
cp.exec(cmd, {timeout: 2 ** 30} ...
assert.strictEqual(stdout.trim(), 'child stdout');

As the child process exits immediately after writing to the console, and as the console writing is non-blocking, there is no guarentee that the data will be flushed to the stream, whatever tiny the data element is, even if the parent waits for a long time. ref: #19218

In summary, I believe there is nothing LinuxONE specific, and the reason rooted in the test case itself.

for 1 I propose the child to have a timer that is at least more than 1000 and the parent to maintain a timeout that is between 1000 and the child's timeout.

for 2 I propose to end the standard streams in the child before exiting in expectation of draining the buffered data.

@Trott
Copy link
Member Author

Trott commented May 7, 2018

Another one today:

https://ci.nodejs.org/job/node-test-commit-linux/18540/nodes=fedora-last-latest-x64/consoleText

ot ok 193 parallel/test-child-process-exec-timeout
  ---
  duration_ms: 0.336
  severity: fail
  exitcode: 1
  stack: |-
    assert.js:77
      throw new AssertionError(obj);
      ^
    
    AssertionError [ERR_ASSERTION]: Input A expected to strictly equal input B:
    + expected - actual
    
    - 'SIGSEGV'
    + 'SIGTERM'
        at cp.exec.common.mustCall (/home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-last-latest-x64/test/parallel/test-child-process-exec-timeout.js:34:12)
        at /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-last-latest-x64/test/common/index.js:474:15
        at ChildProcess.exithandler (child_process.js:289:5)
        at ChildProcess.emit (events.js:182:13)
        at maybeClose (internal/child_process.js:957:16)
        at Process.ChildProcess._handle.onexit (internal/child_process.js:246:5)
  ...

Will see if I can implement fixes for the issues identified by @gireeshpunathil unless he or someone else already has something in the works or even submitted.

@Trott Trott changed the title Investigate flaky test-child-process-exec-timeout on LInuxONE Investigate flaky test-child-process-exec-timeout on LInux May 7, 2018
@Trott
Copy link
Member Author

Trott commented May 7, 2018

Is the fact that we can get SIGSEGV instead of SIGTERM from this code a bug or a fact of life? (If a bug, is it in this code or in Node.js/libuv?)

'use strict';

if (process.argv[2] === 'child') {
  setTimeout(() => {}, 1000);
  return;
}

const cmd = `"${process.execPath}" "${__filename}" child`;

const assert = require('assert');
const cp = require('child_process');

cp.exec(cmd, { timeout: 1 }, (err, stdout, stderr) => {
 assert.strictEqual(err.signal, 'SIGTERM');
});
  • If it's not a bug, we should allow SIGSEGV in the test.
  • If it's a bug in libuv, we should mark the test as flaky (putting the flaky and not-flaky test cases, if any, in separate files) and leave a comment in parallel.status with the relevant issue in the libuv tracker (opening it if it doesn't already exist).
  • If it's a bug in Node.js, uh, let's fix it. :-D
  • If it's a bug in the test logic, let's fix that.

I propose the child to have a timer that is at least more than 1000 and the parent to maintain a timeout that is between 1000 and the child's timeout.

That may mitigate things, but I don't think it will fix it entirely because it's still dependent on timing in two independent processes. If we're going to go that route, we should definitely move this test to sequential so that the processes aren't competing with other tests for resources.

@Trott
Copy link
Member Author

Trott commented May 7, 2018

Update: I was able to trivially cause the problem using the 15 lines of code in the previous comment on test-rackspace-fedora26-x64-1:

root@test-rackspace-fedora26-x64-1 ~]# /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-last-latest-x64/node /var/tmp/test.js &  /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-last-latest-x64/node /var/tmp/test.js & 
[1] 5212
[2] 5213
[root@test-rackspace-fedora26-x64-1 ~]# assert.js:77
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Input A expected to strictly equal input B:
+ expected - actual

- 'SIGSEGV'
+ 'SIGTERM'
    at cp.exec (/var/tmp/test.js:14:9)
    at ChildProcess.exithandler (child_process.js:289:5)
    at ChildProcess.emit (events.js:182:13)
    at maybeClose (internal/child_process.js:957:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:246:5)

Bonus: The process hung there too. I had to end it with cntl-C.

@Trott
Copy link
Member Author

Trott commented May 7, 2018

(The above reproduction doesn't happen every time. Like, not even close to every time. Although I certainly suspect that running 4 or 8 or 16 simultaneous instances would increase the reproducibility.)

@Trott
Copy link
Member Author

Trott commented May 7, 2018

Update: Reproduces pretty reliably at 64 simultaneous instances.

@Trott
Copy link
Member Author

Trott commented May 7, 2018

@nodejs/libuv

@gireeshpunathil
Copy link
Member

link this with #11052 as that has new findings.

@gireeshpunathil
Copy link
Member

As #11052 has revealed an unsafe singal stack issue, it explains one type of failure in this test: expected SIGTERM but got SIGSEGV so I would park my theories on the test case logic. The other type: expected child stdout but got ' ' is unrelated to this issue, and may require a rework.

@Trott Trott changed the title Investigate flaky test-child-process-exec-timeout on LInux Investigate flaky test-child-process-exec-timeout on Linux Jun 3, 2018
@Trott Trott closed this as completed Jun 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants