Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test-http-agent #6133

Closed
mscdex opened this issue Apr 9, 2016 · 42 comments
Closed

Investigate flaky test-http-agent #6133

mscdex opened this issue Apr 9, 2016 · 42 comments
Labels
arm Issues and PRs related to the ARM platform. http Issues or PRs related to the http subsystem. test Issues and PRs related to the tests.

Comments

@mscdex
Copy link
Contributor

mscdex commented Apr 9, 2016

Example failure on pi2-raspbian-wheezy #1
Example failure on pi2-raspbian-wheezy #2

Similar issue here as in #5938?

Output:

# TIMEOUT
#0 200
#1 200
#2 200
#3 200
#4 200
#5 200
#6 200
#7 200
#8 200
#9 200
#10 200
#11 200
#12 200
#13 200
#14 200
@mscdex mscdex added http Issues or PRs related to the http subsystem. test Issues and PRs related to the tests. arm Issues and PRs related to the ARM platform. labels Apr 9, 2016
@mscdex
Copy link
Contributor Author

mscdex commented Apr 9, 2016

Ref: #5346

@Trott
Copy link
Member

Trott commented Apr 10, 2016

While looking at this test being flaky in February, it was discovered to fail always on a connection in the low 20s. In other words, if it failed, it failed on connection 22 or 23 or 24, but never on 16 or 36 or 46 or 59 or...

Which was very odd...

And now it seems that we're seeing that number creep lower resulting in increased flakiness on Pi 2 devices. I have no explanation. It's really weird.

/cc @nodejs/build @rvagg

@Trott
Copy link
Member

Trott commented Apr 11, 2016

Again: https://ci.nodejs.org/job/node-test-binary-arm/1681/RUN_SUBSET=3,nodes=pi2-raspbian-wheezy/tapTestReport/test.tap-53/

Ugh. Not sure what to do. Bringing it down to 12 or so would probably fix it, but it would seem that in another month, we'd likely be facing the same issue again.

@Trott
Copy link
Member

Trott commented Apr 11, 2016

cc @nodejs/testing

@mscdex
Copy link
Contributor Author

mscdex commented Apr 15, 2016

@santigimeno
Copy link
Member

While looking at this test being flaky in February, it was discovered to fail always on a connection in the low 20s. In other words, if it failed, it failed on connection 22 or 23 or 24, but never on 16 or 36 or 46 or 59 or...

@Trott Would it be worth stress testing the test with the source code we had in February, just to check whether it's a problem in the code or in the raspberry bots?

@Trott
Copy link
Member

Trott commented Apr 15, 2016

@Trott
Copy link
Member

Trott commented Apr 15, 2016

And another: https://ci.nodejs.org/job/node-test-binary-arm/1723/RUN_SUBSET=2,nodes=pi2-raspbian-wheezy/console

I suppose it's superfluous to document them all here as the test is now failing so often.

@Trott
Copy link
Member

Trott commented Apr 15, 2016

@Trott
Copy link
Member

Trott commented Apr 15, 2016

@Trott
Copy link
Member

Trott commented Apr 15, 2016

Confounding results: Previous version of the test (that used 100 connections) failed, of course.

But no failures on any of the other stress tests, including current master.

Maybe its flakiness is dependent on other things going on on the network or something? I mean, it shouldn't be, right? Maybe confirm that the test is using localhost and not something odd like the machine's networked IP or (would this next one even work?) a broadcast address or something...

@Trott
Copy link
Member

Trott commented Apr 16, 2016

Continues to fail with alarming frequency. Running CI stress test against master. https://ci.nodejs.org/job/node-stress-single-test/596/nodes=pi2-raspbian-wheezy/console

@Trott
Copy link
Member

Trott commented Apr 16, 2016

Stress test on master is showing failures, so that's at least what we would expect.

Running a stress test on the code base as it existed in February right after the flakiness for this test was fixed the last time around: https://ci.nodejs.org/job/node-stress-single-test/597/nodes=pi2-raspbian-wheezy/console

In theory at least, if that shows lots of failures, then something is up with the devices/CI. If that stress test does not show failures, then that suggests something changed in the code base.

@Trott
Copy link
Member

Trott commented Apr 16, 2016

@santigimeno
Copy link
Member

master fails, and February passes, we should probably start bisecting

@Trott
Copy link
Member

Trott commented Apr 16, 2016

Bisecting begun. Will log each step here to someone else can pick it up if they want to as I definitely don't expect to get it done all in one sitting. (Running the CI stress test can take a loooong time depending on if a full build needs to happen or not...)

Current master is bad.

bbf4621 is good.

Stress test for 08085c4: https://ci.nodejs.org/job/node-stress-single-test/601/nodes=pi2-raspbian-wheezy/console

@santigimeno
Copy link
Member

@Trott
Copy link
Member

Trott commented Apr 16, 2016

@Trott
Copy link
Member

Trott commented Apr 17, 2016

@Trott
Copy link
Member

Trott commented Apr 17, 2016

@Trott
Copy link
Member

Trott commented Apr 17, 2016

@Trott
Copy link
Member

Trott commented Apr 17, 2016

@Trott
Copy link
Member

Trott commented Apr 17, 2016

@Trott
Copy link
Member

Trott commented Apr 17, 2016

@jbergstroem
Copy link
Member

@Trott: Did a restart of jenkins in here somewhere; just want to give you a heads up in case the restart killed a job.

@Trott
Copy link
Member

Trott commented Apr 17, 2016

@Trott
Copy link
Member

Trott commented Apr 17, 2016

b85a50b is good.

According to this process, the first bad commit is 3de9bc9.

That seems extremely unlikely. (Only things in that commit are a doc update and a readline test.)

This raises the frustrating specter that perhaps the CI flakiness itself is unpredictable. Like, maybe one time you run the test 100 times and it comes up fine every time, but you do it again and it fails 50 of those 100 times. Maybe depending on what device runs the test?

@Trott
Copy link
Member

Trott commented Apr 17, 2016

Looks like the stress test will sometimes come up clean (100 successful runs, no failures) but fail frequently at other times. Never seems to fail just once or twice. Either success or a bunch of failures. Argh.

So b85a50b is bad after all. It comes up clean in some stress test runs (see https://ci.nodejs.org/job/node-stress-single-test/612/nodes=pi2-raspbian-wheezy/console and https://ci.nodejs.org/job/node-stress-single-test/615/nodes=pi2-raspbian-wheezy/console) but fails in others (see https://ci.nodejs.org/job/node-stress-single-test/613/nodes=pi2-raspbian-wheezy/console and https://ci.nodejs.org/job/node-stress-single-test/616/nodes=pi2-raspbian-wheezy/console).

I'm inclined to go back to the beginning with bbf4621 and confirm that it isn't flaky in the current CI by running the stress test a half dozen times or so.

@Trott
Copy link
Member

Trott commented Apr 17, 2016

@Trott
Copy link
Member

Trott commented Apr 17, 2016

Tests confirm that 757fbac is the last good commit and b85a50b is the first bad one. Now the questions are:

  • Is the problem one of configuration with the Pi2 devices?
  • Or is the problem with the code change running on Pi2 (which is just Linux, right? and not that different from other Pi devices on which we're not seeing problems, right?)
  • Or is the problem in the test somehow?

/cc @rvagg @cjihrig

@cjihrig
Copy link
Contributor

cjihrig commented Apr 18, 2016

Does this only happen on a specific subset of the Pi2s?

I'd think that the change in b85a50b would cause something to break completely, not in a flaky manner. That test also doesn't have a great track record.

@santigimeno
Copy link
Member

@Trott
Copy link
Member

Trott commented Apr 19, 2016

@cjihrig asked:

Does this only happen on a specific subset of the Pi2s?

Nope, I'm afraid not, at least not as far as I've been able to tell.

@cjihrig
Copy link
Contributor

cjihrig commented Apr 19, 2016

Trying a partial revert of b85a50b. Stress test at https://ci.nodejs.org/job/node-stress-single-test/655/ with the ADDRCONFIG flag added back.

@santigimeno
Copy link
Member

@cjihrig, it might be worth running multiple smaller (~100 times) jobs so it picks different pi2's, just to be sure it's fixed, as we have seen runs without failures before and maybe it's somehow dependent on the bot its running on.

@Trott
Copy link
Member

Trott commented Apr 20, 2016

They all came back green!

joelostrowski pushed a commit to joelostrowski/node that referenced this issue Apr 25, 2016
b85a50b removed the implicit
setting of DNS hints when creating a connection. This caused some
of the pi2 machines to become flaky. This commit restores the
implicit dns.ADDRCONFIG hint, but not dns.V4MAPPED.

Fixes: nodejs#6133
PR-URL: nodejs#6281
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Claudio Rodriguez <cjrodr@yahoo.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Saúl Ibarra Corretgé <saghul@gmail.com>
jasnell pushed a commit that referenced this issue Apr 26, 2016
b85a50b removed the implicit
setting of DNS hints when creating a connection. This caused some
of the pi2 machines to become flaky. This commit restores the
implicit dns.ADDRCONFIG hint, but not dns.V4MAPPED.

Fixes: #6133
PR-URL: #6281
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Claudio Rodriguez <cjrodr@yahoo.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Saúl Ibarra Corretgé <saghul@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arm Issues and PRs related to the ARM platform. http Issues or PRs related to the http subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

5 participants