Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: investigate flaky test-child-process-stdio-big-write-end #13603

Closed
refack opened this issue Jun 10, 2017 · 18 comments
Closed

test: investigate flaky test-child-process-stdio-big-write-end #13603

refack opened this issue Jun 10, 2017 · 18 comments
Labels
arm Issues and PRs related to the ARM platform. child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests.

Comments

@refack
Copy link
Contributor

refack commented Jun 10, 2017

  • Version: master
  • Platform: arm
  • Subsystem: test
1528	parallel/test-child-process-stdio-big-write-end	
duration_ms	120.161
severity	fail
stack	timeout

https://ci.nodejs.org/job/node-test-commit-arm/10256/nodes=ubuntu1604-arm64/

@refack refack added arm Issues and PRs related to the ARM platform. test Issues and PRs related to the tests. labels Jun 10, 2017
@mscdex mscdex added the child_process Issues and PRs related to the child_process subsystem. label Jun 10, 2017
@Trott
Copy link
Member

Trott commented Jun 10, 2017

Also, just to make it clear this isn't a one-time fluke:

https://ci.nodejs.org/job/node-test-commit-arm/10251/nodes=ubuntu1604-arm64/console

not ok 1516 parallel/test-child-process-stdio-big-write-end
  ---
  duration_ms: 120.600
  severity: fail
  stack: |-
    timeout

@refack
Copy link
Contributor Author

refack commented Jun 11, 2017

@Trott
Copy link
Member

Trott commented Jun 11, 2017

Looks to me like this corresponds precisely with switching from mininodes to packetnet (in the hostname for the test server), but I don't know the first thing about when/why/how that happened or what the implications are or if there are significant differences in memory/CPU/etc. /cc @nodejs/build

@Trott
Copy link
Member

Trott commented Jun 11, 2017

Given the big in the test name and looking at the test, it sure seems like it could fail if the test runner is insufficiently provisioned, but unlike a lot of tests like that, it doesn't do a check for memory or anything before running and I don't think we've ever had a problem on Raspberry Pi.

Which makes me wonder if it's all about network or something like that?

Also makes me wonder if it belongs in pummel but it sure runs nice and fast on my MacBook, so probably not?

@tniessen
Copy link
Member

@Trott
Copy link
Member

Trott commented Jun 11, 2017

I think this is officially a broken test/platform and not merely flaky/unreliable...

@refack
Copy link
Contributor Author

refack commented Jun 11, 2017

Tried to bisect, but even 8.0.0 fails:
https://ci.nodejs.org/job/node-test-commit-arm/10270/nodes=ubuntu1604-arm64/

@Trott
Copy link
Member

Trott commented Jun 11, 2017

Tried to bisect, but even 8.0.0 fails:

@refack Yeah, not too surprising. I think it's the new server/host and not the code. It seemed to break right when things changed. I think @rvagg handled the change and might be able to shed more light.

@Trott
Copy link
Member

Trott commented Jun 11, 2017

Got a log in to test-packetnet-ubuntu1604-arm64-2. The test hangs when run from the command line. No hung processes prior, no output...

Probably appropriate at this point to loop in @nodejs/testing even though it seems host-configuration specific. Maybe someone will have an idea of what might be up....

@Trott
Copy link
Member

Trott commented Jun 12, 2017

I did a little more investigating on the problematic host and the issue is that this bit of the test is an infinite loop (or infinite-seeming in any event) on the host for whatever reason:

  // Write until the buffer fills up.
  let buf;
  do {
    buf = Buffer.alloc(BUFSIZE, '.');
    sent += BUFSIZE;
  } while (child.stdin.write(buf));

@Trott
Copy link
Member

Trott commented Jun 12, 2017

Proposed fix coming in another minute or four....

Trott added a commit to Trott/io.js that referenced this issue Jun 12, 2017
test-child-process-stdio-big-write-end was failing on ubuntu1604-arm64
because the while loop that was supposed to fill up the buffer ended up
being an infinite loop.

This increases the size of the writes in the loop by 1K until the buffer
fills up.

Fixes: nodejs#13603
@Trott
Copy link
Member

Trott commented Jun 12, 2017

Proposed fix: #13626

@rvagg
Copy link
Member

rvagg commented Jun 12, 2017

@nodejs/testing for context: the mininodes hosts are still active, they have just been relabelled to ubuntu1604-arm64_odroid_c2. The new packet.net hosts are proper server-class ARM machines, not these repurposed mobile chips that we've had access to until now.

The major difference you'll find on these new packet.net machines: they have 96 cores and 48G of RAM. We have not virtualized or containerized anything so they are running on bare metal. At the moment we are running at ~ JOBS=50 (doing some experimentation on that front). We also have CentOS 7 machines running at packet.net, our first non-Debian ARM machines.

They have access to very fast SSD so the bottlenecks appear pretty late in the parallelization. My assumption with this error when I first saw it was that they are too heavily parallelized so I was reducing JOBS, but it's only been appearing on the Ubuntu 16.04 machines so I guess this is some kind of system problem.

The hosts are all accessible to everyone that has nodejs_build_test ssh access, they configs are in the new ansible setup (ansible directory of the build repo, look in the inventory.yml if you want IPs but there is a way to dump everything to your .ssh/config if you want, I just can't tell you off the top of my head). There are 2 x Ubuntu 16.04 and 2 x CentOS 7.

@Trott
Copy link
Member

Trott commented Jun 12, 2017

@rvagg Thanks for the info. The problem in this case appears to be that the machines are too performant for the test. To fix it, I rewrote the test to increase the amount of data written until the situation required by the test was achieved. #13626

@rvagg
Copy link
Member

rvagg commented Jun 12, 2017

extra context over @ nodejs/build#755 for those interested in the introduction of packet.net resources

@gibfahn
Copy link
Member

gibfahn commented Jun 12, 2017

For sshing in, clone the build repo update your inventory.yml with nodejs/build#754, make sure your ~/.ssh/config has:

# begin: node.js template

# end: node.js template

, and do:

cd ansible
ansible-playbook playbooks/write-ssh-config.yml

@Trott Trott closed this as completed in b71d677 Jun 12, 2017
@tniessen
Copy link
Member

The test might still be flaky, see node-test-commit-arm/10353, where it fails on centos7-arm64 and ubuntu1604-arm64:

not ok 81 parallel/test-child-process-stdio-big-write-end
  ---
  duration_ms: 120.89
  severity: fail
  stack: |-
    timeout
  ...

addaleax pushed a commit that referenced this issue Jun 17, 2017
test-child-process-stdio-big-write-end was failing on ubuntu1604-arm64
because the while loop that was supposed to fill up the buffer ended up
being an infinite loop.

This increases the size of the writes in the loop by 1K until the buffer
fills up.

PR-URL: #13626
Fixes: #13603
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Alexey Orlenko <eaglexrlnk@gmail.com>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
addaleax pushed a commit that referenced this issue Jun 21, 2017
test-child-process-stdio-big-write-end was failing on ubuntu1604-arm64
because the while loop that was supposed to fill up the buffer ended up
being an infinite loop.

This increases the size of the writes in the loop by 1K until the buffer
fills up.

PR-URL: #13626
Fixes: #13603
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Alexey Orlenko <eaglexrlnk@gmail.com>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit that referenced this issue Jul 17, 2017
test-child-process-stdio-big-write-end was failing on ubuntu1604-arm64
because the while loop that was supposed to fill up the buffer ended up
being an infinite loop.

This increases the size of the writes in the loop by 1K until the buffer
fills up.

PR-URL: #13626
Fixes: #13603
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Alexey Orlenko <eaglexrlnk@gmail.com>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arm Issues and PRs related to the ARM platform. child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants