-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testJITServer_0 testServerUnreachableForAWhile The process is still alive after waiting for 30000ms #14594
Comments
https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_s390x_linux_Release_testList_0/62
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.functional_ppc64le_linux_OpenJDK21_testList_0/2
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.functional_ppc64le_linux_Nightly_testList_1/3
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_sanity.functional_aarch64_linux_Nightly_testList_1/691/consoleFull - ub20-aarch64-osu-2
https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_1/686 - ub20-aarch64-osu-2
testJITServer_2
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk23_j9_sanity.functional_aarch64_linux_Nightly_testList_0/3
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_sanity.functional_aarch64_linux_Nightly_testList_1/704
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk23_j9_sanity.functional_aarch64_linux_Nightly_testList_1/44 - ub20-aarch64-osu-6 These alinux OSU machines may be too slow. We used to have other faster machines but they are gone now. |
@mpirvu if there is nothing wrong, pls consider changing the test timeout so we don't keep getting these failures. |
@cjjdespres Could you please look into this issue? Thanks |
These tests can fail because either the client or the server does not shut down in response to SIGTERM in sufficient time. In these failures, it's always the client failing to shut down at the end of the test, sometimes with the server still active and sometimes with the server already shut down. I've tried reproducing this in grinder on aarch64 and locally on x86 with no failures. If it is just an issue of these particular test machines being slow enough that an orderly shutdown takes more than 30s, then it's not too surprising that I couldn't observe any failures. Maybe it's related to networking? The default socket timeouts are 30s as well, but I wouldn't have thought that would be an issue. We could increase the timeout and see if that makes the issue go away. Maybe we could increase the wait time to a minute to start. |
Now that I look, I apparently have access to the public test machines, so I could try running the tests (possibly modified) directly on them and see what I get. |
A small update - I couldn't see anything obviously wrong in the logs from some failing runs. The client appears to shut down all the compilation threads and attempt to send the final termination message to the server at the time it ought to. Application activity still goes on for a bit after that - I can see from the log that the class loader table is still having entries added and removed - but that's true of the successful runs as well. The failing ones just appear to go on for longer. Once I finish testing the linked PR and get it merged, this should hopefully go away. EDIT: I should also mention that I did a bit of testing and the socket timeouts do not seem to be the cause of the delay. The failing run had this:
right after all the compilation threads had shut down and before that final test activity. That means the client finished its attempt to alert the JITServer that it was shutting down at almost exactly when it should have, if you add up the wait times in that test. That's the last message that would have been sent to the server. |
https://openj9-jenkins.osuosl.org/job/Test_openjdk23_j9_sanity.functional_aarch64_linux_Nightly_testList_0/70
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk23_j9_sanity.functional_aarch64_linux_Nightly_testList_1/78
|
I've looked at it again, and I think I've found the problem. Here in the communication stream read functions, we ignore If we're already using |
The aarch machines are fast, but we due tend to have more network issues with them. |
I see. I've been trying to test if my solution in #14594 (comment) works, and I'm not entirely sure if what I outlined is actually a problem. It could have been an artifact of how I was running my simpler test in GDB, among other things. In particular, I don't think reads or writes will result in EINTR indefinitely after the SIGTERM. In this test, the SIGTERM should in fact be delivered to the main thread alone, which will determine that it's a termination signal and spawn the SIGTERM handler thread. That handler thread will then coordinate the JVM shutdown. The network calls in the compilation threads in this test don't appear to notice that a SIGTERM was issued at all. I've done more testing, and with a very slow (but not too slow) network I can get the shutdown process to take over 60s, sometimes close to 90s. That's because the compilation threads need to progress until they reach a point where they notice that they need to abort the compilation, and that might involve a couple of slow That being said, the logs of these latest failures do not match what I get locally. There aren't any "Stopping compilation thread..." messages in the logs, and at least some of the compilation threads are not compiling anything, which should make them easy to stop. So, I modified the test to try to generate a java core if the test timed out, and I got three failures with the modified test. Two of those did not result in a core - indeed, there is no indication that the client reacted to the signal at all - and one did. It appears from the log that the client was paused the entire time the test was waiting for the client to shut down (or at least not producing verbose log messages that it probably should have been producing). The logging only started again when it received the SIGQUIT. I'll have to look into it more, because these recent failures don't seem to be due to networking. |
I think I've narrowed down where the problem is happening, though I don't know exactly what's going wrong. Notably, I've been able to reproduce this without JITServer enabled at all in this test, which I ran with the extra options These tests work by having the client run Because there is so much class loader/class compilation activity (possibly also aided by the fact that a known deficiency in the persistent class loader table causes a lot of AOT load failures in these tests), a global GC cycle will often kick off right at the start of one of these loops. It happens here in the verbose logs:
Under normal circumstances, after unloading ~48 class loaders and their classes, there will then be a compilation request for The end of the GC log is:
This line was logged in the console when the test harness tried to stop the client:
In the class loader unloading phase, I think the only active thread will be I'm not sure what's going wrong at this point. It doesn't look like the main thread finishes GC completely, and no more compilation activity is logged. It's odd that there is a delay of ~12 seconds between the last logged GC cycle and the "Stopping client" message, but in the last GC cycle that unloaded these test loaders, I see:
so there can be a delay of about that long between the start of the previous gc operation and the end of the unload operation. |
Someone with more familiarity with these GC cycles might be better suited to look into this. |
@dmitripivkine pls take a look. |
Can ~12s inactivity itself cause client termination? There is low efficient implementation behaviour known for years: during class unloading GC triggers If such extended delay can cause this failure you can try to force Class Unloading to be executed more often to keep number of loaded classes lower using @amicic FYI |
https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_x86-64_linux_aot_Personal/40 - cent7-x64-5
testJITServer_0
-Xshareclasses:name=test_aot -Xscmx400M -Xscmaxaot256m -Xcompressedrefs -Xjit -Xgcpolicy:gencon
@mpirvu
The text was updated successfully, but these errors were encountered: