Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AArch64 macOS: Failure in JDK 11 ASSERTION FAILED ParallelScavengeTask.cpp:120 ((false && ((endTime - startTime) >= _syncCriticalSectionDuration))) #14636

Closed
knn-k opened this issue Mar 3, 2022 · 28 comments

Comments

@knn-k
Copy link
Contributor

knn-k commented Mar 3, 2022

Failure link

https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.openjdk_aarch64_mac_Nightly/2/ - mac11-aarch64-4

Optional info

  • intermittent failure: yes
  • regression or new test: regression

Failure output (captured from console output)

No output available: Only Unexpected exit from test [exit code: 255]
https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.openjdk_aarch64_mac_Nightly/2/testReport/junit/java_lang_invoke_VarHandles_VarHandleTestByteArrayAsInt/java/VarHandleTestByteArrayAsInt/

@pshipton
Copy link
Member

pshipton commented Mar 3, 2022

May be related to #13368

@pshipton
Copy link
Member

pshipton commented Mar 3, 2022

Ignore the previous comment, there is additional output

23:22:37  04:22:16.257 0x12606f100    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Nightly/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))

@pshipton
Copy link
Member

pshipton commented Mar 3, 2022

@knn-k
Copy link
Contributor Author

knn-k commented Mar 3, 2022

I ran Grinder jobs below, and they were successful.

@pshipton
Copy link
Member

pshipton commented Mar 7, 2022

https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.system_aarch64_mac_Nightly_testList_0/3 - mac11-aarch64-8
LambdaLoadTest_J9_5m_0 -Xjit -Xgcpolicy:gencon -Xnocompressedrefs

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk11_j9_sanity.system_aarch64_mac_Nightly_testList_0/3/system_test_output.tar.gz

LT  stderr 05:52:26.178 0x142820700    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Nightly/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
LT  stderr 05:52:26.178 0x142837700    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Nightly/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
LT  stderr 05:52:26.178 0x14280af00    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Nightly/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
LT  stderr 05:52:26.178 0x153193500    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Nightly/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
LT  stderr 05:52:26.178 0x142814f00    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Nightly/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
LT  stderr 05:52:26.178 0x1530c0900    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Nightly/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
LT  stderr 05:52:26.178 0x1530c9900    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Nightly/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))

@pshipton pshipton changed the title AArch64 macOS: Failure in JDK 11 java/lang/invoke/VarHandles/VarHandleTestByteArrayAsInt AArch64 macOS: Failure in JDK 11 ASSERTION FAILED ParallelScavengeTask.cpp:120 ((false && ((endTime - startTime) >= _syncCriticalSectionDuration))) Mar 7, 2022
@knn-k
Copy link
Contributor Author

knn-k commented Mar 9, 2022

20x Grinder jobs for LambdaLoadTest_J9_5m_0 did not fail: internal job/Grinder/21639/ , job/Grinder/21641/ , job/Grinder/21643/ , job/Grinder/21644/

@knn-k
Copy link
Contributor Author

knn-k commented Mar 9, 2022

Possible causes:

  • MM_ParallelTask::synchronizeGCThreadsAndReleaseSingleThread() finishes too early
  • Values returned from omrtime_hires_clock() are wrong or imprecise
  • _syncCriticalSectionDuration has unexpected value: It is set in MM_ParallelTask::releaseSynchronizedGCThreads(), otherwise it is 0.

@knn-k
Copy link
Contributor Author

knn-k commented Mar 10, 2022

Grinder jobs for jdk_lang_1 did not fail: internal job/Grinder/21661/, job/Grinder/21665/, job/Grinder/21666/

@knn-k
Copy link
Contributor Author

knn-k commented Apr 5, 2022

Reproduced in job/Test_openjdk18_j9_special.system_aarch64_mac_Personal_testList_4/1/consoleText in running DaaLoadTest_all_special_5m_9.

[2022-04-05T05:07:52.959Z] DLT 01:07:52.688 - Starting thread. Suite=0 thread=5
[2022-04-05T05:08:01.245Z] DLT stderr 05:07:59.887 0x15996ab00    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK18_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
[2022-04-05T05:08:01.245Z] DLT stderr 05:07:59.887 0x159999500    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK18_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))

@knn-k
Copy link
Contributor Author

knn-k commented Apr 7, 2022

I ran a 20x Grinder job for DaaLoadTest_all_special_5m_9 in job/Grinder/22364/, and got no failures.

@pshipton
Copy link
Member

pshipton commented Apr 18, 2022

https://openj9-jenkins.osuosl.org/job/Test_openjdk18_j9_special.system_aarch64_mac_Personal_testList_0/2
DaaLoadTest_daa2_special_5m_29 -Xgcpolicy:gencon -Xgc:dynamicBreadthFirstScanOrdering

DLT 06:03:08.407 - Completed 26.7%. Number of tests started=35820 (+11485)
DLT stderr 20:03:13.757 0x14e106900    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK18_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_special.system_aarch64_mac_Release_testList_4/2
DaaLoadTest_daa2_special_5m_5 -Xgcpolicy:gencon -Xjit:count=0,optlevel=warm,gcOnResolve,rtResolve -Xmn512k -Xnocompressedrefs

@dmitripivkine
Copy link
Contributor

dmitripivkine commented Apr 18, 2022

Sorry, I missed there is assertion in GC code

This is snap traces:

20:03:13.757515000        *0x14e106900       j9mm.542  Entry      >Trc_MM_SynchronizeGCThreadsAndReleaseSingleThread Entry from: /Users/jenkins/workspace/Build_JDK18_aarch64_mac_Personal/openj9/runtime/gc_glue_java/ScavengerRootScanner.hpp:168
20:03:13.757528000        *0x14e91e900       j9mm.542  Entry      >Trc_MM_SynchronizeGCThreadsAndReleaseSingleThread Entry from: /Users/jenkins/workspace/Build_JDK18_aarch64_mac_Personal/openj9/runtime/gc_glue_java/ScavengerRootScanner.hpp:168
20:03:13.757529000         0x14e91e900       j9mm.543  Exit       <Trc_MM_SynchronizeGCThreadsAndReleaseSingleThread Exit
20:03:13.757555000         0x14e91e900       j9mm.122  Entry      >Trc_MM_SynchronizeGCThreads Entry from: /Users/jenkins/workspace/Build_JDK18_aarch64_mac_Personal/openj9/runtime/gc_glue_java/ScavengerRootClearer.hpp:118
20:03:13.757561000        *0x14e106900       j9mm.107  Assert    * ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK18_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
20:03:13.757561000         0x14e106900       j9mm.543  Exit       <Trc_MM_SynchronizeGCThreadsAndReleaseSingleThread Exit

I believe last two lines are in incorrect order due both have the same time stamp

There are two threads performed GC operation:

  • Thread 0x14e106900 reached ScavengerRootScanner.hpp:168 and waited for 0x14e91e900 in MM_ParallelTask::synchronizeGCThreadsAndReleaseSingleThread().
  • Thread 0x14e91e900 arrived, saved _syncCriticalSectionStartTime and execute settings of flags in MM_ScavengerRootScanner::scanClearable() single threaded. After this it called releaseSynchronizedGCThreads() where _syncCriticalSectionDuration is calculated.
  • Thread 0x14e106900 was waiting on monitor all this time and was released at the end of releaseSynchronizedGCThreads() by thread 0x14e91e900. So, time between start and end in thread 0x14e106900 must be larger or equal that _syncCriticalSectionDuration.
  • Thread 0x14e91e900 meanwhile reached next sync point at ScavengerRootClearer.hpp:118 and was waiting for thread 0x14e106900 there.

As was stated correctly in the comment #14636 (comment)
the reason for assertion might be different flow of operations (monitor does not work properly?) or inconsistent readings of wall clock (?). When we will have DDR support available we can check _syncCriticalSectionStartTime and _syncCriticalSectionDuration. Also in theory it is possible there is might be some compiler side effect (_syncCriticalSectionDuration is not declared as volatile - I think it should not be, but who knows). We can check generated sequence of instructions (but if it is a case I would expect larger number of failures)

@knn-k
Copy link
Contributor Author

knn-k commented Apr 19, 2022

DDR implementation for AArch64 macOS is in progress in eclipse-omr/omr#6471 and #14921.

@pshipton
Copy link
Member

pshipton commented May 24, 2022

https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_special.system_aarch64_mac_Personal_testList_4/8
DaaLoadTest_daa1_special_5m_5 -Xgcpolicy:gencon -Xjit:count=0,optlevel=warm,gcOnResolve,rtResolve -Xmn512k -Xnocompressedrefs

DLT 06:40:14.504 - Completed 40.1%. Number of tests started=1193 (+278)
DLT stderr 20:40:19.773 0x14a1a8100    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
DLT stderr 20:40:19.773 0x14a0f2300    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))

DaaLoadTest_all_special_5m_9 -Xaggressive -Xgcpolicy:gencon -Xjit -Xnocompressedrefs

DLT stderr 21:23:05.199 0x1328b6700    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
DLT stderr 21:23:05.199 0x1328cf300    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
DLT stderr 21:23:05.199 0x123821100    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
DLT stderr 21:23:05.199 0x1328c6300    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))
DLT stderr 21:23:05.199 0x12382b700    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK11_aarch64_mac_Personal/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk11_j9_special.system_aarch64_mac_Personal_testList_4/8/system_test_output.tar.gz

More in
https://openj9-jenkins.osuosl.org/job/Test_openjdk18_j9_special.system_aarch64_mac_Personal_testList_4/6
DaaLoadTest_daa2_special_5m_5 -Xgcpolicy:gencon -Xjit:count=0,optlevel=warm,gcOnResolve,rtResolve -Xmn512k -Xnocompressedrefs
DaaLoadTest_daa2_special_5m_25 -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation -Xgcpolicy:gencon

@dmitripivkine
Copy link
Contributor

@pshipton Do you know, is DDR available for this build?

@pshipton
Copy link
Member

No, there is no DDR support for amac yet. Perhaps tomorrow.

@pshipton
Copy link
Member

amac DDR is now enabled in future builds.

@pshipton
Copy link
Member

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_special.system_aarch64_mac_Release_testList_4/3 - mac11-aarch64-6
DaaLoadTest_daa3_special_5m_5 -Xgcpolicy:gencon -Xjit:count=0,optlevel=warm,gcOnResolve,rtResolve -Xmn512k -Xnocompressedrefs

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_special.system_aarch64_mac_Release_testList_4/3/system_test_output.tar.gz

DLT 23:13:00.903 - Completed 80.1%. Number of tests started=12005 (+1002)
DLT stderr 13:13:03.186 0x13901b300    j9mm.107    *   ** ASSERTION FAILED ** at /Users/jenkins/workspace/Build_JDK17_aarch64_mac_Release/omr/gc/base/standard/ParallelScavengeTask.cpp:120: ((false && ((endTime - startTime) >= _syncCriticalSectionDuration)))

@dmitripivkine fyi - DDR is enabled now.

@dmitripivkine
Copy link
Contributor

dmitripivkine commented Jun 13, 2022

This assertion does check that waiting time in the synchronizeGCThreadsAndReleaseSingleThread() for blocked thread should be larger or equal than total execution time recorded by thread performed single threaded operation. However total execution time recorded by thread performed single threaded operation is negative (-25 ticks). And treated as unsigned 0xFFFFFFFFFFFFFFE7 it happen to be larger than waiting time of blocked thread

@dmitripivkine
Copy link
Contributor

The execution time for thread performed single threaded operation was very short in this case. I have not looked in c-stack/registers for values in ticks but Snap traces shows 0us difference, so it was under 1us. And somehow EndTime (taken at the end) is smaller by 25 ticks as StartTime(taken at the beginning). Can actually wall clock adjustment (backwards) be visible by one thread and missed by another?

@dmitripivkine
Copy link
Contributor

dmitripivkine commented Jun 14, 2022

In order to get time GC code uses omrtime_hires_clock(). OSX uses general OMR Unix implementation based on gettimeofday(). This is reading of wall clock and it is not monotonic. So, it can jump back and force.
There is old but I believe still valid blog on topic:
https://web.archive.org/web/20100517095152/http://www.wand.net.nz/~smr26/wordpress/2009/01/19/monotonic-time-in-mac-os-x/comment-page-1/

@knn-k
Copy link
Contributor Author

knn-k commented Jun 15, 2022

I wonder why the failure happens only on AArch64 macOS if it is caused by lacking the monotonicity of the clock.
Potentially x86 macOS has the same issue, but M1 chip is so fast that it reveals the issue?

@knn-k
Copy link
Contributor Author

knn-k commented Jun 15, 2022

I opened eclipse-omr/omr#6568 for monotonic omrtime_hires_clock() implementation on macOS.
I ran sanity.functional and sanity.openjdk with it, and the tests pass.

  • sanity.functional: job/Test_openjdk11_j9_sanity.functional_aarch64_mac_Personal/21/
  • sanity.openjdk: job/Grinder/24972/

@knn-k
Copy link
Contributor Author

knn-k commented Jun 16, 2022

eclipse-omr/omr#6568 was superseded by eclipse-omr/omr#6573.
I use nanoseconds as the unit for omrtime_hires_clock() on macOS in it.

@knn-k
Copy link
Contributor Author

knn-k commented Jun 16, 2022

OMR PR 6573 was merged. Let's see if the assertion failure is reproduced or not with the change.

@knn-k
Copy link
Contributor Author

knn-k commented Jun 23, 2022

It is a week since OMR PR 6573 was merged, and I don't think we have seen a new failure.
Shall I open another PR for v0.33.0?

@knn-k
Copy link
Contributor Author

knn-k commented Jun 23, 2022

I opened eclipse-openj9/openj9-omr#150 for v0.33.0. It is in the draft state.

@pshipton
Copy link
Member

I've merged it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants