Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK19 java/lang/Thread/virtual/stress/TimedGet.java timed out #17163

Closed
JasonFengJ9 opened this issue Apr 12, 2023 · 12 comments · Fixed by #17190
Closed

JDK19 java/lang/Thread/virtual/stress/TimedGet.java timed out #17163

JasonFengJ9 opened this issue Apr 12, 2023 · 12 comments · Fixed by #17190

Comments

@JasonFengJ9
Copy link
Member

JasonFengJ9 commented Apr 12, 2023

Failure link

From an internal build(paix822):

openjdk version "19.0.2" 2023-01-17
IBM Semeru Runtime Open Edition 19.0.2.0-rc1 (build 19.0.2+7)
Eclipse OpenJ9 VM 19.0.2.0-rc1 (build v0.37.0-release-5db9c94a0, JRE 19 AIX ppc64-64-Bit Compressed References 20230117_86 (JIT enabled, AOT enabled)
OpenJ9   - 5db9c94a0
OMR      - d0b42675f
JCL      - a95c00a432d based on jdk-19.0.2+7)

Rerun in Grinder - Change TARGET to run only the failed test targets.

Optional info

Failure output (captured from console output)

[2023-04-11T21:01:59.434Z] variation: -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage Mode650
[2023-04-11T21:01:59.434Z] JVM_OPTIONS:  -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage -XX:-UseCompressedOops 

[2023-04-11T22:09:20.038Z] TEST: java/lang/Thread/virtual/stress/TimedGet.java

[2023-04-11T22:09:20.038Z] Note: /home/jenkins/workspace/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_testList_1/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/lang/Thread/virtual/stress/TimedGet.java uses preview features of Java SE 19.
[2023-04-11T22:09:20.038Z] Note: Recompile with -Xlint:preview for details.
[2023-04-11T22:09:20.038Z] 
[2023-04-11T22:09:20.038Z] ACTION: build -- Passed. All files up to date
[2023-04-11T22:09:20.038Z] REASON: Named class compiled on demand
[2023-04-11T22:09:20.038Z] TIME:   0.0 seconds
[2023-04-11T22:09:20.038Z] messages:
[2023-04-11T22:09:20.038Z] command: build TimedGet
[2023-04-11T22:09:20.038Z] reason: Named class compiled on demand
[2023-04-11T22:09:20.038Z] elapsed time (seconds): 0.0
[2023-04-11T22:09:20.038Z] 
[2023-04-11T22:09:20.038Z] ACTION: main -- Error. Program `/home/jenkins/workspace/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_testList_1/openjdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 1337250ms).
[2023-04-11T22:09:20.038Z] REASON: User specified action: run main/othervm -Xmx1g --enable-preview TimedGet 
[2023-04-11T22:09:20.038Z] TIME:   1337.273 seconds
[2023-04-11T22:09:20.038Z] messages:
[2023-04-11T22:09:20.038Z] command: main -Xmx1g --enable-preview TimedGet
[2023-04-11T22:09:20.038Z] reason: User specified action: run main/othervm -Xmx1g --enable-preview TimedGet 
[2023-04-11T22:09:20.038Z] Mode: othervm [/othervm specified]
[2023-04-11T22:09:20.038Z] Timeout information:
[2023-04-11T22:09:20.038Z] Running jcmd on process 10224058
[2023-04-11T22:09:20.038Z] Timeout refired 300 times
[2023-04-11T22:09:20.038Z] Dump written to /home/jenkins/workspace/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_testList_1/aqa-tests/TKG/output_16812469189102/jdk_lang_1/work/scratch/core.20230411.180648.10224058.0001.dmp
[2023-04-11T22:09:20.038Z] Timeout handler interrupted: 
[2023-04-11T22:09:20.038Z] java.lang.InterruptedException
[2023-04-11T22:09:20.038Z] 	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1611)
[2023-04-11T22:09:20.038Z] 	at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:447)
[2023-04-11T22:09:20.038Z] 	at jtreg.openj9.CoreDumpTimeoutHandler.runJcmdCommand(CoreDumpTimeoutHandler.java:117)
[2023-04-11T22:09:20.038Z] 	at jtreg.openj9.CoreDumpTimeoutHandler.runJcmd(CoreDumpTimeoutHandler.java:137)
[2023-04-11T22:09:20.038Z] 	at jtreg.openj9.CoreDumpTimeoutHandler.runActions(CoreDumpTimeoutHandler.java:55)
[2023-04-11T22:09:20.038Z] 	at com.sun.javatest.regtest.TimeoutHandler.handleTimeout(TimeoutHandler.java:120)
[2023-04-11T22:09:20.038Z] 	at com.sun.javatest.regtest.exec.ProcessCommand$2.run(ProcessCommand.java:317)
[2023-04-11T22:09:20.038Z] --- Timeout information end.
[2023-04-11T22:09:20.038Z] elapsed time (seconds): 1337.273
[2023-04-11T22:09:20.038Z] configuration:
[2023-04-11T22:09:20.038Z] STDOUT:
[2023-04-11T22:09:20.038Z] Timeout refired 960 times
[2023-04-11T22:09:20.038Z] STDERR:
[2023-04-11T22:09:20.066Z] JVMDUMP034I User requested System dump using '/home/jenkins/workspace/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_testList_1/aqa-tests/TKG/output_16812469189102/jdk_lang_1/work/scratch/core.20230411.180648.10224058.0001.dmp' through com.ibm.jvm.Dump.systemDumpToFile
[2023-04-11T22:09:20.066Z] JVMDUMP010I System dump written to /home/jenkins/workspace/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_testList_1/aqa-tests/TKG/output_16812469189102/jdk_lang_1/work/scratch/core.20230411.180648.10224058.0001.dmp

[2023-04-11T22:09:20.067Z] TEST RESULT: Error. Program `/home/jenkins/workspace/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_testList_1/openjdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 1337250ms).
[2023-04-11T22:09:20.067Z] --------------------------------------------------
[2023-04-11T22:16:00.765Z] Test results: passed: 853; error: 1
[2023-04-11T22:16:09.820Z] Report written to /home/jenkins/workspace/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_testList_1/aqa-tests/TKG/output_16812469189102/jdk_lang_1/report/html/report.html
[2023-04-11T22:16:09.820Z] Results written to /home/jenkins/workspace/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_testList_1/aqa-tests/TKG/output_16812469189102/jdk_lang_1/work
[2023-04-11T22:16:09.820Z] Error: Some tests failed or other problems occurred.
[2023-04-11T22:16:09.820Z] -----------------------------------
[2023-04-11T22:16:09.820Z] jdk_lang_1_FAILED

50x internal grinder - 5 failures

@pshipton
Copy link
Member

@babsingh @fengxue-IS pls take a look.

@babsingh
Copy link
Contributor

TimedGet is a stress/perf test. It's a timeout. Most likely due to poor perf during sub-4G allocation.

Grinder w/ -Xnocompressedrefs

Potential duplicate of #15184 if no timeouts are seen in the above grinder.

@pshipton
Copy link
Member

It's not sub-4G allocation since it's failing in jdk_lang_1 mode.
-Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage -XX:-UseCompressedOops

The test is also running with -Xmx1g.

@pshipton
Copy link
Member

pshipton commented Apr 13, 2023

Results from Jason's grinder.

https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder_iteration_3/1831
TimedGet - Invalid JIT return address
and
18:54:01 22:52:19.743 0x100238df600 j9vm.224 * ** ASSERTION FAILED ** at /home/jenkins/workspace/build-scripts/jobs/jdk19u/jdk19u-aix-ppc64-openj9/workspace/build/src/openj9/runtime/vm/ContinuationHelpers.cpp:276: ((!VM_VMHelpers::isConcurrentlyScanned(continuationState) && ((0) == VM_VMHelpers::getCarrierThread(continuationState))))

https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder_iteration_4/1732/
TimedGet - timed out but also passed?
Skynet -
18:30:36 22:30:08.815 0x10023d6fb00 j9vm.224 * ** ASSERTION FAILED ** at /home/jenkins/workspace/build-scripts/jobs/jdk19u/jdk19u-aix-ppc64-openj9/workspace/build/src/openj9/runtime/vm/ContinuationHelpers.cpp:276: ((!VM_VMHelpers::isConcurrentlyScanned(continuationState) && ((0) == VM_VMHelpers::getCarrierThread(continuationState))))

https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder_iteration_0/2414
Skynet - 2 x Invalid JIT return address

@pshipton
Copy link
Member

Results from Babneet's grinder.

https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder_iteration_0/2418/
TimedGet - timed out but also passed?

https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder_iteration_4/1736/
TimedGet - 2 x timed out but also passed?

https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder_iteration_1/2247/
TimedGet - 2 x timed out but also passed?

@fengxue-IS
Copy link
Contributor

fengxue-IS commented Apr 13, 2023

Look at the corefile from Babneet's grinder, the test seem to be stuck waiting on the last 2 vthread to complete.

I've traced to the continuation object which was still alive at time of core generation.

> !fj9object 0xa00000008b47b28
!J9Object 0x0A00000008B47B28 {
	struct J9Class* clazz = !j9class 0x10023352B00 // java/lang/VirtualThread$VThreadContinuation
	Object flags = 0x00000000;
	J lockword = 0x0000000000000008 (offset = 0) (java/lang/Object) <hidden>
	J vmRef = 0x000001003A753630 (offset = 8) (jdk/internal/vm/Continuation)
	Ljava/lang/Thread; vthread = !fj9object 0xa00000008b47a28 (offset = 32) (jdk/internal/vm/Continuation)
	Ljdk/internal/vm/ContinuationScope; scope = !fj9object 0xa0000000018e9e0 (offset = 40) (jdk/internal/vm/Continuation)
	Ljava/lang/Runnable; runnable = !fj9object 0xa00000008b47bd8 (offset = 48) (jdk/internal/vm/Continuation)
	Ljdk/internal/vm/Continuation; parent = !fj9object 0x0 (offset = 56) (jdk/internal/vm/Continuation)
	Z started = 0x00000001 (offset = 64) (jdk/internal/vm/Continuation)
	Z finished = 0x00000000 (offset = 68) (jdk/internal/vm/Continuation)
	J state = 0x000001002351E105 (offset = 16) (jdk/internal/vm/Continuation)
	Z isAccessible = 0x00000000 (offset = 72) (jdk/internal/vm/Continuation)
	J continuationLink = 0x0A00000008BA1870 (offset = 24) (jdk/internal/vm/Continuation) <hidden>

The state of the continuation looks to have J9_GC_CONTINUATION_STATE_PENDING_TO_BE_MOUNTED set.

@LinHu2016 I looked through the code where this flag is set and one question that I have is
in the wait code https://github.com/eclipse-openj9/openj9/blob/master/runtime/vm/ContinuationHelpers.cpp#L118-L121

		while (VM_VMHelpers::isConcurrentlyScanned(returnContinuationState)) {
			PUSH_OBJECT_IN_SPECIAL_FRAME(currentThread, continuationObject);
			internalReleaseVMAccess(currentThread);

			omrthread_monitor_enter(currentThread->publicFlagsMutex);
			/* Wait for GC thread to notify us when it's done. */
			omrthread_monitor_wait(currentThread->publicFlagsMutex);
			omrthread_monitor_exit(currentThread->publicFlagsMutex);

			internalAcquireVMAccess(currentThread);

if the concurrent scan ended and notification is sent after the while check but before the wait on public flag, this would become a hang until some other event triggers a notify right?

This would explain why the test passed after timeout signal is triggered? I assume this triggered some signal handling (after releasing EVMA) code which notified all thread hence unblocking.

I wasn't able to get the native stack trace on my local machine so this is unconfirmed. will have to look at the core on the grinder machine to verify

@pshipton
Copy link
Member

pshipton commented Apr 13, 2023

xlinux grinder -Xnocompressedrefs - passed 100 runs of TimedGet

plinux grinder 100x
2x timeout but passed in https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder_iteration_0/2423

xlinux grinder 200x
2x timeout but passed in https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder_iteration_2/2046/

@pshipton
Copy link
Member

We'll continue to investigate this for jdk20.

@pshipton
Copy link
Member

Created #17181 for the assert.

@amicic
Copy link
Contributor

amicic commented Apr 13, 2023

if the concurrent scan ended and notification is sent after the while check but before the wait on public flag, this would become a hang until some other event triggers a notify right?

@fengxue-IS Nice catch - yes, this seems like a real problem.

A dirty fix is to put a timed wait, but ATM I think a real fix would require a new public flag and atomically (using publicFlagMutex) set/check/reset it along with operations on Continuation state (setting/checking/reseting pending and concurrent scan bits). Will think more if this will work or if there are other solutions.... @LinHu2016

@fengxue-IS
Copy link
Contributor

would it be sufficient to move the mutex acquire before while loop condition?

@amicic
Copy link
Contributor

amicic commented Apr 13, 2023

considering that too... that would be simpler and more performant, too (would require some changes on notification side - to remove the scan flag under the mutex)

mikezhang1234567890 pushed a commit to mikezhang1234567890/openj9 that referenced this issue Apr 17, 2023
thallium added a commit to thallium/aqa-tests that referenced this issue Jun 8, 2023
TimedGet: eclipse-openj9/openj9#17163
TracePinnedThreads: eclipse-openj9/openj9#15936
ClassUnloading: eclipse-openj9/openj9#16053
HumongousStack: eclipse-openj9/openj9#15189

Signed-off-by: Gengchen Tuo <gengchen.tuo@ibm.com>
LongyuZhang pushed a commit to adoptium/aqa-tests that referenced this issue Jun 9, 2023
TimedGet: eclipse-openj9/openj9#17163
TracePinnedThreads: eclipse-openj9/openj9#15936
ClassUnloading: eclipse-openj9/openj9#16053
HumongousStack: eclipse-openj9/openj9#15189

Signed-off-by: Gengchen Tuo <gengchen.tuo@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants