-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate flaky test-http-agent #6133
Comments
Ref: #5346 |
While looking at this test being flaky in February, it was discovered to fail always on a connection in the low 20s. In other words, if it failed, it failed on connection 22 or 23 or 24, but never on 16 or 36 or 46 or 59 or... Which was very odd... And now it seems that we're seeing that number creep lower resulting in increased flakiness on Pi 2 devices. I have no explanation. It's really weird. /cc @nodejs/build @rvagg |
Ugh. Not sure what to do. Bringing it down to 12 or so would probably fix it, but it would seem that in another month, we'd likely be facing the same issue again. |
cc @nodejs/testing |
@Trott Would it be worth stress testing the test with the source code we had in February, just to check whether it's a problem in the code or in the raspberry bots? |
@santigimeno I suppose that's worth a shot. And, another one: https://ci.nodejs.org/job/node-test-binary-arm/1728/RUN_SUBSET=2,nodes=pi2-raspbian-wheezy/tapTestReport/test.tap-53/ |
And another: https://ci.nodejs.org/job/node-test-binary-arm/1723/RUN_SUBSET=2,nodes=pi2-raspbian-wheezy/console I suppose it's superfluous to document them all here as the test is now failing so often. |
Per @santigimeno's suggestion, here are three CI stress tests: |
And one more, also at @santigimeno's suggestion: |
Confounding results: Previous version of the test (that used 100 connections) failed, of course. But no failures on any of the other stress tests, including current master. Maybe its flakiness is dependent on other things going on on the network or something? I mean, it shouldn't be, right? Maybe confirm that the test is using localhost and not something odd like the machine's networked IP or (would this next one even work?) a broadcast address or something... |
Continues to fail with alarming frequency. Running CI stress test against master. https://ci.nodejs.org/job/node-stress-single-test/596/nodes=pi2-raspbian-wheezy/console |
Stress test on master is showing failures, so that's at least what we would expect. Running a stress test on the code base as it existed in February right after the flakiness for this test was fixed the last time around: https://ci.nodejs.org/job/node-stress-single-test/597/nodes=pi2-raspbian-wheezy/console In theory at least, if that shows lots of failures, then something is up with the devices/CI. If that stress test does not show failures, then that suggests something changed in the code base. |
Hmmm...February run is not failing. Repeating for confirmation: |
master fails, and February passes, we should probably start bisecting |
Bisecting begun. Will log each step here to someone else can pick it up if they want to as I definitely don't expect to get it done all in one sitting. (Running the CI stress test can take a loooong time depending on if a full build needs to happen or not...) Current master is bad. bbf4621 is good. Stress test for 08085c4: https://ci.nodejs.org/job/node-stress-single-test/601/nodes=pi2-raspbian-wheezy/console |
08085c4 is good. Stress test for 54a5287: https://ci.nodejs.org/job/node-stress-single-test/604/ |
08085c4 is good. Now running a stress test on ef6c4c6: https://ci.nodejs.org/job/node-stress-single-test/605/nodes=pi2-raspbian-wheezy/console |
ef6c4c6 is good as is 54a5287. Now stress testing b743d82: https://ci.nodejs.org/job/node-stress-single-test/606/nodes=pi2-raspbian-wheezy/console |
b743d82 is bad. Now stress testing ba0b769: https://ci.nodejs.org/job/node-stress-single-test/607/nodes=pi2-raspbian-wheezy/console |
ba0b769 is good. Now stress testing ae2be27: https://ci.nodejs.org/job/node-stress-single-test/608/nodes=pi2-raspbian-wheezy/console |
ae2be27 is good. Now stress testing 757fbac: https://ci.nodejs.org/job/node-stress-single-test/609/nodes=pi2-raspbian-wheezy/console |
757fbac is good. Now stress testing 0a62f92: https://ci.nodejs.org/job/node-stress-single-test/610/nodes=pi2-raspbian-wheezy/console |
0a62f92 is bad. Now stress testing 3de9bc9: https://ci.nodejs.org/job/node-stress-single-test/611/nodes=pi2-raspbian-wheezy/console |
@Trott: Did a restart of jenkins in here somewhere; just want to give you a heads up in case the restart killed a job. |
b85a50b is good. According to this process, the first bad commit is 3de9bc9. That seems extremely unlikely. (Only things in that commit are a doc update and a readline test.) This raises the frustrating specter that perhaps the CI flakiness itself is unpredictable. Like, maybe one time you run the test 100 times and it comes up fine every time, but you do it again and it fails 50 of those 100 times. Maybe depending on what device runs the test? |
To test the "frustrating specter" possibility mentioned above, I'm running four more stress tests against b85a50b (which came up good when I ran just one stress test):
And four more against 3de9bc9 (which came up bad):
|
Looks like the stress test will sometimes come up clean (100 successful runs, no failures) but fail frequently at other times. Never seems to fail just once or twice. Either success or a bunch of failures. Argh. So b85a50b is bad after all. It comes up clean in some stress test runs (see https://ci.nodejs.org/job/node-stress-single-test/612/nodes=pi2-raspbian-wheezy/console and https://ci.nodejs.org/job/node-stress-single-test/615/nodes=pi2-raspbian-wheezy/console) but fails in others (see https://ci.nodejs.org/job/node-stress-single-test/613/nodes=pi2-raspbian-wheezy/console and https://ci.nodejs.org/job/node-stress-single-test/616/nodes=pi2-raspbian-wheezy/console). I'm inclined to go back to the beginning with bbf4621 and confirm that it isn't flaky in the current CI by running the stress test a half dozen times or so. |
Repeat last good: 757fbac 4 more times and confirming is good. |
Looks like @santigimeno also kicked off two jobs that confirm that 0d41463
This would seem to flag b85a50b as the culprit, which is at least a lot more believable. Still, let's confirm by stress testing the immediate prior commit (757fbac) several times:
|
Tests confirm that 757fbac is the last good commit and b85a50b is the first bad one. Now the questions are:
|
Does this only happen on a specific subset of the Pi2s? I'd think that the change in b85a50b would cause something to break completely, not in a flaky manner. That test also doesn't have a great track record. |
FWIW. I run 4 stress test jobs with current master, reverting the suspect commit and they all passed: |
@cjihrig asked:
Nope, I'm afraid not, at least not as far as I've been able to tell. |
Trying a partial revert of b85a50b. Stress test at https://ci.nodejs.org/job/node-stress-single-test/655/ with the |
@cjihrig, it might be worth running multiple smaller (~100 times) jobs so it picks different pi2's, just to be sure it's fixed, as we have seen runs without failures before and maybe it's somehow dependent on the bot its running on. |
They all came back green! |
b85a50b removed the implicit setting of DNS hints when creating a connection. This caused some of the pi2 machines to become flaky. This commit restores the implicit dns.ADDRCONFIG hint, but not dns.V4MAPPED. Fixes: nodejs#6133 PR-URL: nodejs#6281 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Claudio Rodriguez <cjrodr@yahoo.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com> Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Saúl Ibarra Corretgé <saghul@gmail.com>
b85a50b removed the implicit setting of DNS hints when creating a connection. This caused some of the pi2 machines to become flaky. This commit restores the implicit dns.ADDRCONFIG hint, but not dns.V4MAPPED. Fixes: #6133 PR-URL: #6281 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Claudio Rodriguez <cjrodr@yahoo.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com> Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Saúl Ibarra Corretgé <saghul@gmail.com>
Example failure on pi2-raspbian-wheezy #1
Example failure on pi2-raspbian-wheezy #2
Similar issue here as in #5938?
Output:
The text was updated successfully, but these errors were encountered: