Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testJITServer_0 testServerUnreachableForAWhile The process is still alive after waiting for 30000ms #14594

Open
pshipton opened this issue Feb 24, 2022 · 24 comments
Labels
comp:jitserver Artifacts related to JIT-as-a-Service project test failure

Comments

@pshipton
Copy link
Member

https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_x86-64_linux_aot_Personal/40 - cent7-x64-5
testJITServer_0 -Xshareclasses:name=test_aot -Xscmx400M -Xscmaxaot256m -Xcompressedrefs -Xjit -Xgcpolicy:gencon

FAILED: testServerUnreachableForAWhile
java.lang.AssertionError: The process is still alive after waiting for 30000ms.
	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:236)
	at jit.test.jitserver.JITServerTest.testServerUnreachableForAWhile(JITServerTest.java:434)

@mpirvu

@pshipton
Copy link
Member Author

pshipton commented May 1, 2023

https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_s390x_linux_Release_testList_0/62
testJITServer_0

FAILED: testServerUnreachableForAWhile
java.lang.AssertionError: The process is still alive after waiting for 30000ms.
	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:346)
	at jit.test.jitserver.JITServerTest.testServerUnreachableForAWhile(JITServerTest.java:544)

@pshipton
Copy link
Member Author

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.functional_ppc64le_linux_OpenJDK21_testList_0/2
testJITServer_1

FAILED: testServerComesUpAfterClientAndGoesDownAgain
java.lang.AssertionError: The process is still alive after waiting for 30000ms.
	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:346)
	at jit.test.jitserver.JITServerTest.testServerComesUpAfterClientAndGoesDownAgain(JITServerTest.java:601)

@mpirvu mpirvu added the comp:jitserver Artifacts related to JIT-as-a-Service project label Jun 30, 2023
@pshipton
Copy link
Member Author

pshipton commented Jul 6, 2023

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.functional_ppc64le_linux_Nightly_testList_1/3
testJITServer_1

FAILED: testServerComesUpAfterClient
java.lang.AssertionError: The process is still alive after waiting for 30000ms.
	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:346)
	at jit.test.jitserver.JITServerTest.testServerComesUpAfterClient(JITServerTest.java:568)

@pshipton
Copy link
Member Author

pshipton commented Jun 10, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_sanity.functional_aarch64_linux_Nightly_testList_1/691/consoleFull - ub20-aarch64-osu-2
testJITServer_0

22:36:49  FAILED: testServerUnreachableForAWhile
22:36:49  java.lang.AssertionError: The process is still alive after waiting for 30000ms.
22:36:49  	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
22:36:49  	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:347)
22:36:49  	at jit.test.jitserver.JITServerTest.testServerUnreachableForAWhile(JITServerTest.java:545)

https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_1/686 - ub20-aarch64-osu-2
testJITServer_0

00:28:57  FAILED: testServerComesUpAfterClientAndGoesDownAgain
00:28:57  java.lang.AssertionError: The process is still alive after waiting for 30000ms.
00:28:57  	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
00:28:57  	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:347)
00:28:57  	at jit.test.jitserver.JITServerTest.testServerComesUpAfterClientAndGoesDownAgain(JITServerTest.java:602)

testJITServer_2

00:43:49  FAILED: testServerComesUpAfterClient
00:43:49  java.lang.AssertionError: The process is still alive after waiting for 30000ms.
00:43:49  	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
00:43:49  	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:347)
00:43:49  	at jit.test.jitserver.JITServerTest.testServerComesUpAfterClient(JITServerTest.java:569)

@pshipton
Copy link
Member Author

https://openj9-jenkins.osuosl.org/job/Test_openjdk23_j9_sanity.functional_aarch64_linux_Nightly_testList_0/3
testJITServer_1

22:53:25  FAILED: testServerGoesDownAnotherComesUp
22:53:25  java.lang.AssertionError: The process is still alive after waiting for 30000ms.
22:53:25  	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
22:53:25  	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:347)
22:53:25  	at jit.test.jitserver.JITServerTest.testServerGoesDownAnotherComesUp(JITServerTest.java:501)

@pshipton
Copy link
Member Author

https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_sanity.functional_aarch64_linux_Nightly_testList_1/704
testJITServer_2

21:59:19  FAILED: testServerComesUpAfterClient
21:59:19  java.lang.AssertionError: The process is still alive after waiting for 30000ms.
21:59:19  	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
21:59:19  	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:347)
21:59:19  	at jit.test.jitserver.JITServerTest.testServerComesUpAfterClient(JITServerTest.java:569)

@pshipton
Copy link
Member Author

@mpirvu if there is nothing wrong, pls consider changing the test timeout so we don't keep getting these failures.

@mpirvu
Copy link
Contributor

mpirvu commented Aug 19, 2024

@cjjdespres Could you please look into this issue? Thanks

@cjjdespres
Copy link
Contributor

These tests can fail because either the client or the server does not shut down in response to SIGTERM in sufficient time. In these failures, it's always the client failing to shut down at the end of the test, sometimes with the server still active and sometimes with the server already shut down. I've tried reproducing this in grinder on aarch64 and locally on x86 with no failures.

If it is just an issue of these particular test machines being slow enough that an orderly shutdown takes more than 30s, then it's not too surprising that I couldn't observe any failures. Maybe it's related to networking? The default socket timeouts are 30s as well, but I wouldn't have thought that would be an issue.

We could increase the timeout and see if that makes the issue go away. Maybe we could increase the wait time to a minute to start.

@cjjdespres
Copy link
Contributor

Now that I look, I apparently have access to the public test machines, so I could try running the tests (possibly modified) directly on them and see what I get.

@cjjdespres
Copy link
Contributor

cjjdespres commented Aug 30, 2024

A small update - I couldn't see anything obviously wrong in the logs from some failing runs. The client appears to shut down all the compilation threads and attempt to send the final termination message to the server at the time it ought to. Application activity still goes on for a bit after that - I can see from the log that the class loader table is still having entries added and removed - but that's true of the successful runs as well. The failing ones just appear to go on for longer. Once I finish testing the linked PR and get it merged, this should hopefully go away.

EDIT: I should also mention that I did a bit of testing and the socket timeouts do not seem to be the cause of the delay. The failing run had this:

#JITServer: t=144070 Resetting activation policy to AGGRESSIVE because client has lost connection to server
#JITServer: JITServer StreamFailure (server unreachable before the termination message was sent): Connect failed: Connection refused

right after all the compilation threads had shut down and before that final test activity. That means the client finished its attempt to alert the JITServer that it was shutting down at almost exactly when it should have, if you add up the wait times in that test. That's the last message that would have been sent to the server.

@pshipton
Copy link
Member Author

https://openj9-jenkins.osuosl.org/job/Test_openjdk23_j9_sanity.functional_aarch64_linux_Nightly_testList_0/70
testJITServer_2

02:30:00  FAILED: testServerUnreachableForAWhile
02:30:00  java.lang.AssertionError: The process is still alive after waiting for 60000ms.
02:30:00  	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
02:30:00  	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:347)
02:30:00  	at jit.test.jitserver.JITServerTest.testServerUnreachableForAWhile(JITServerTest.java:545)

@pshipton
Copy link
Member Author

pshipton commented Oct 4, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk23_j9_sanity.functional_aarch64_linux_Nightly_testList_1/78
testJITServer_0

01:00:50  FAILED: testServerComesUpAfterClientAndGoesDownAgain
01:00:50  java.lang.AssertionError: The process is still alive after waiting for 60000ms.
01:00:50  	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
01:00:50  	at jit.test.jitserver.JITServerTest.destroyAndCheckProcess(JITServerTest.java:347)
01:00:50  	at jit.test.jitserver.JITServerTest.testServerComesUpAfterClientAndGoesDownAgain(JITServerTest.java:602)

@cjjdespres
Copy link
Contributor

I've looked at it again, and I think I've found the problem. Here

https://github.com/eclipse-openj9/openj9/blob/master/runtime/compiler/net/CommunicationStream.hpp

in the communication stream read functions, we ignore EINTR errors and continue to try to read from the file descriptor. I believe this was done to fix #15394. That means that if the java process receives a SIGTERM in a small window before or during one of those read calls on a particular compilation thread, the read will fail with EINTR, I believe, and the compilation thread will block the JVM from exiting. I suppose these aarch64 machines are slow enough that this window can be hit somewhat regularly.

If we're already using _numConsecutiveReadErrorsOfSameType in readOnceBlocking, we could make sure that EINTR is registered as a read error and check that _numConsecutiveReadErrorsOfSameType isn't too large before retrying the read.

@pshipton
Copy link
Member Author

The aarch machines are fast, but we due tend to have more network issues with them.

@cjjdespres
Copy link
Contributor

The aarch machines are fast, but we due tend to have more network issues with them.

I see. I've been trying to test if my solution in #14594 (comment) works, and I'm not entirely sure if what I outlined is actually a problem. It could have been an artifact of how I was running my simpler test in GDB, among other things. In particular, I don't think reads or writes will result in EINTR indefinitely after the SIGTERM.

In this test, the SIGTERM should in fact be delivered to the main thread alone, which will determine that it's a termination signal and spawn the SIGTERM handler thread. That handler thread will then coordinate the JVM shutdown. The network calls in the compilation threads in this test don't appear to notice that a SIGTERM was issued at all.

I've done more testing, and with a very slow (but not too slow) network I can get the shutdown process to take over 60s, sometimes close to 90s. That's because the compilation threads need to progress until they reach a point where they notice that they need to abort the compilation, and that might involve a couple of slow read() calls. The shutdown routine also has to try to connect() to the server to send the termination message. The logs from my local tests seem to match those of the earlier failures, so I think that slow networking might have been the cause of them.

That being said, the logs of these latest failures do not match what I get locally. There aren't any "Stopping compilation thread..." messages in the logs, and at least some of the compilation threads are not compiling anything, which should make them easy to stop.

So, I modified the test to try to generate a java core if the test timed out, and I got three failures with the modified test. Two of those did not result in a core - indeed, there is no indication that the client reacted to the signal at all - and one did. It appears from the log that the client was paused the entire time the test was waiting for the client to shut down (or at least not producing verbose log messages that it probably should have been producing). The logging only started again when it received the SIGQUIT.

I'll have to look into it more, because these recent failures don't seem to be due to networking.

@cjjdespres
Copy link
Contributor

I think I've narrowed down where the problem is happening, though I don't know exactly what's going wrong. Notably, I've been able to reproduce this without JITServer enabled at all in this test, which I ran with the extra options -Xjit:verbose={compilePerformance|compileRequest|hookDetailsClassUnloading} -Xaot:forceAOT -XX:-UseJITServer -Xshareclasses:name=fresh_scc -verbose:gc. (A JITServer instance is started in the tests, but is ignored by the client). The role of JITServer in these test failures seems to be to fairly reliably set up a scenario in which the failure occurs, which involves the timing of a global garbage collection cycle with the SIGTERM that's sent to the client. Some of the failures look different than what I'll describe, but that might just be variation in how it manifests.

These tests work by having the client run JarTester with -loopforever on jvmtest/functional/JIT_Test/jitt.jar. The jar tester will load each class in jitt.jar, and run Java_java_lang_Compiler_compileClassImpl on each class. The clients in these JITServer tests often run for a while (around 1.5 minutes in some cases), so this loop can be run a few times. Some of these classes have a lot of methods - the first class in jitt.jar seems to be jit/test/tr/decomposition/IntegerMultiplyDecomposer, and that has methods test00001 through test16340.

Because there is so much class loader/class compilation activity (possibly also aided by the fact that a known deficiency in the persistent class loader table causes a lot of AOT load failures in these tests), a global GC cycle will often kick off right at the start of one of these loops. It happens here in the verbose logs:

#CR:  000000000000D800   Compile request OrdinaryMethod j9method=0000000000598C38 jit/test/recognizedMethod/TestJavaLangStringLatin1.test_java_lang_StringLatin1_inflate()V optLevel=2
#CR:  000000000000D800   Added entry 0000FFFF9414C130 of weight 1 to comp queue. Now Q_SZ=1 weight=1
 (warm) Compiling jit/test/recognizedMethod/TestJavaLangStringLatin1.test_java_lang_StringLatin1_inflate()V  OrdinaryMethod j9m=0000000000598C38  t=141427 compThreadID=1 memLimit=262144 KB freePhysicalMemory=7122 MB
+ (AOT load) jit/test/recognizedMethod/TestJavaLangStringLatin1.test_java_lang_StringLatin1_inflate()V @ 0000FFFF864ADC90-0000FFFF864B4300 Q_SZ=0 Q_SZI=0 QW=1 j9m=0000000000598C38 bcsz=7807 time=256us compThreadID=1 queueTime=313us
#HD:  Class unloading for class=0x00000000004B0300
...
#HD:  Class unloading for classLoader=0x0000FFFF940A5FB8
...

Under normal circumstances, after unloading ~48 class loaders and their classes, there will then be a compilation request for jit/test/tr/decomposition/IntegerMultiplyDecomposer.<init>()V, which should mean that main has entered the compileClassImpl loop. It looks like the hang occurs between the time the class loaders are being removed (in MM_GlobalCollectorDelegate::unloadDeadClassLoaders) and the end of the GC cycle. In a failure, different things can happen, but in the failure I got without JITServer the log stops after 20 "Class unloading" lines.

The end of the GC log is:

<gc-start id="494" type="global" contextid="493" timestamp="2024-11-03T13:05:24.369">
  <mem-info id="495" free="3037928" total="8519680" percent="35">
    <mem type="nursery" free="1088584" total="2228224" percent="48">
      <mem type="allocate" free="1088584" total="1441792" percent="75" />
      <mem type="survivor" free="0" total="786432" percent="0" />
    </mem>
    <mem type="tenure" free="1949344" total="6291456" percent="30">
      <mem type="soa" free="1319584" total="5661696" percent="23" />
      <mem type="loa" free="629760" total="629760" percent="100" />
    </mem>
    <pending-finalizers system="104" default="0" reference="55" classloader="0" />
    <remembered-set count="280" />
  </mem-info>
</gc-start>
<allocation-stats totalBytes="0" discardedBytes="0" >
  <allocated-bytes non-tlh="0" tlh="0" />
</allocation-stats>
<heap-resize id="496" type="loa expand" space="tenure" amount="62464" count="1" timems="0.000" reason="expand on failed allocate" timestamp="2024-11-03T13:05:24.369" />
<gc-op id="497" type="mark" timems="2.748" contextid="493" timestamp="2024-11-03T13:05:24.372">
  <trace-info objectcount="25535" scancount="19795" scanbytes="683276" />
  <finalization candidates="83" enqueued="1" />
  <ownableSynchronizers candidates="3" cleared="0" />
  <references type="soft" candidates="421" cleared="0" enqueued="0" dynamicThreshold="15" maxThreshold="32" />
  <references type="weak" candidates="62" cleared="0" enqueued="0" />
  <references type="phantom" candidates="8" cleared="0" enqueued="0" />
  <stringconstants candidates="2334" cleared="0"  />
</gc-op>

This line was logged in the console when the test harness tried to stop the client:

[2024-11-03T13:05:36.454Z] [JITServerTest] [INFO] Stopping client...

In the class loader unloading phase, I think the only active thread will be main, which will be running the garbage collection with exclusive VM access. The other threads should be waiting on VM access or on other monitors for work to do. When the SIGTERM comes in, the installed handler should try to spawn the SIGTERM handler via the signal handler thread. That spawning does require VM access, so there can be a delay during which nothing else has been notified of a shutdown.

I'm not sure what's going wrong at this point. It doesn't look like the main thread finishes GC completely, and no more compilation activity is logged. It's odd that there is a delay of ~12 seconds between the last logged GC cycle and the "Stopping client" message, but in the last GC cycle that unloaded these test loaders, I see:

<gc-op id="426" type="mark" timems="2.803" contextid="422" timestamp="2024-11-03T13:04:27.493">
  <trace-info objectcount="25397" scancount="19714" scanbytes="679292" />
  <finalization candidates="57" enqueued="3" />
  <ownableSynchronizers candidates="3" cleared="0" />
  <references type="soft" candidates="421" cleared="0" enqueued="0" dynamicThreshold="15" maxThreshold="32" />
  <references type="weak" candidates="62" cleared="0" enqueued="0" />
  <references type="phantom" candidates="8" cleared="0" enqueued="0" />
  <stringconstants candidates="2334" cleared="0"  />
</gc-op>
<gc-op id="427" type="classunload" timems="10833.131" contextid="422" timestamp="2024-11-03T13:04:38.326">
  <classunload-info classloadercandidates="99" classloadersunloaded="90" classesunloaded="1178" anonymousclassesunloaded="0" quiescems="0.000" setupms="67.528" scanms="10765.473" postms="0.129" />
</gc-op>

so there can be a delay of about that long between the start of the previous gc operation and the end of the unload operation.

@cjjdespres
Copy link
Contributor

Someone with more familiarity with these GC cycles might be better suited to look into this.

@pshipton
Copy link
Member Author

pshipton commented Nov 4, 2024

@dmitripivkine pls take a look.

@dmitripivkine
Copy link
Contributor

Can ~12s inactivity itself cause client termination?

There is low efficient implementation behaviour known for years: during class unloading GC triggers J9HOOK_VM_CLASS_UNLOAD for each class to be unloaded. Subscribed code from JIT side walks JIT tables from the beginning for each one. It creates N**2 behaviour. If number of loaded classes is large (ex. tens of thousands) Class Unloading (runs single threaded under STW Exclusive) can take tens of seconds.

If such extended delay can cause this failure you can try to force Class Unloading to be executed more often to keep number of loaded classes lower using -Xgc:classUnloadingThreshold=100 (for example). Unfortunately this option triggers Concurrent Kickoff event, not STW GC directly. So, it might be significant time delay between Concurrent Kickoff and STW GC. If application loads classes aggressively it is possible to end up with significant number of classes to be unloaded any way, even threshold is set low.
As an experiment you can try to use -Xgcpolicy:optthruput or, alternatively disable Concurrent Mark in Gencon using -Xgc:noConcurrentMark.

@amicic FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jitserver Artifacts related to JIT-as-a-Service project test failure
Projects
None yet
Development

No branches or pull requests

4 participants