Fix halted thread being allowed to continue execution #15201

gacholio · 2022-06-02T17:08:09Z

Primary issue: It was possible for a thread which is halted for inspection to resume execution after a GC, resulting in the stack walk crashing in the inspecting thread. This was introduced in #12257 .

Secondary issue: If instrumentable object allocate is hooked, it was possible to HCR at JIT object allocation points.

New restriction: The JVMTI extension event for instrumentable object allocate may now only be acquired at startup, not during late attach. As far as I know, there are no active users of this event.

Note that NOT_AT_SAFE_POINT has two distinct if related meanings:

Normal exclusive VM access has priority over safe point

Setting NOT_AT_SAFE_POINT around all GCing operations accomplishes this.

JIT optimization requires that there be no possibility of HCR at object allocation points

Extending the range of NOT_AT_SAFE_POINT to cover the entire allocation path and disabling safe point if the instumentable object allocate event is hooked ensures this.

This fix has several parts:

Pause after GC if halt has been requested

After any possibly-GCing path, check to see if the thread should halt before resuming mutation. This fixes the reported problem of inspected threads continuing to run. This also requires that NOT_AT_SAFE_POINT be set across the entirety of the allocation path to prevent safe point from being acquired by the new halting code.

Fix object allocation event reporting

If the instrumentable object allocation extension event was hooked, there was a timing hole where HCR could occur at an object allocation from the JIT, which is specificially what safe point HCR is meant to avoid.

This is fixed by marking the thread NOT_AT_SAFE_POINT for the duration of the allocate. If the instrumentable object allocate event is hooked, disable safe point HCR as there is no way to safely report the event.

Mark all possibly GCing paths NOT_AT_SAFE_POINT

This ensures that the GC has priority over safe point HCR.

haltThreadForInspection also now marks the inspecting thread NOT_SAFE for the duration of the halt (see timing below).

Signed-off-by: Graham Chapman graham_chapman@ca.ibm.com

gacholio · 2022-06-02T17:08:46Z

@amicic @tajila Opening this for review/discussion while we get the first OMR PR merged.

amicic · 2022-06-02T18:00:13Z

The changes look good to me, at least the fix part that prevent potential inspector thread to proceed after it requested inspection, but was incorrectly allowed to proceed with inspection, if the inspecting thread went through releasing/reacquiring JNI critical access and so mis-observing that HALT_INSPECTION flag has been raised. Now, we will properly check that flag after reacquire, and block if inspection pending.

But please elaborate more in the commit comment for future readers, why NOT_SAFE had to move to outter points and why this does not compromise deadlock prevention from #12257

dmitripivkine · 2022-06-02T18:27:20Z

Can we get some nested cases for "set J9_PUBLIC_FLAGS_NOT_AT_SAFE_POINT/ do / clear J9_PUBLIC_FLAGS_NOT_AT_SAFE_POINT" pattern potentially? Should we assert if flag is set already?

gacholio · 2022-06-02T18:52:19Z

Can we get some nested cases for "set J9_PUBLIC_FLAGS_NOT_AT_SAFE_POINT/ do / clear J9_PUBLIC_FLAGS_NOT_AT_SAFE_POINT" pattern potentially? Should we assert if flag is set already?

I explicitly handle the only nested case I believe we need to handle - exclusive VM acess sets NOT_AT_SAFE_POINT only if it was not already set, and clears it only if exclusive set it. I don't believe there are any other nested cases in this PR since we're putting the set/clear as far out as possible.

runtime/gc_modron_startup/mgcalloc.cpp

gacholio · 2022-06-03T15:37:20Z

@amicic and I believe this code reintroduces the original hang.

Original issue:

Initial state
   1 VM_ACCESS
   2 VM_ACCESS
   3 VM_ACCESS
1 attempt to halt 2
   2 VM_ACCESS | HALT_INSPECT
1 release VM access
   1 NONE
1 wait for 2 to have HALT_INSPECT and no VM_ACCESS         <- 1 BLOCKED
3 acquire safe point exclusive
3 release VM access
   3 NONE
3 halt threads
   1 HALTED_SAFE
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE
3 wait for response from 2 to have neither VM_ACCESS nor NOT_SAFE <- 3 BLOCKED
2 out of line allocate causing a GC which has to back off exclusive
2 release critical heap access
   2 NOT_SAFE | HALT_INSPECT | REQUEST_SAFE
2 reacquire critical heap access blocks on HALT_INSPECT       <- 2 BLOCKED

Current code:

Initial state
   1 VM_ACCESS
   2 VM_ACCESS
   3 VM_ACCESS
1 attempt to halt 2
   2 VM_ACCESS | HALT_INSPECT
1 release VM access
   1 NONE
1 wait for 2 to have HALT_INSPECT and no VM_ACCESS         <- 1 BLOCKED on HALT_INSPECT
3 acquire safe point exclusive
3 release VM access
   3 NONE
3 halt threads
   1 HALTED_SAFE
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE
3 wait for response from 2 to have neither VM_ACCESS nor NOT_SAFE <- 3 BLOCKED
2 out of line allocate causing a GC which has to back off exclusive
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 release critical heap access
   2 NOT_SAFE | HALT_INSPECT | REQUEST_SAFE | HALT_EXCLUSIVE
1 wakes up, attempt to acquire VM access <- 1 BLOCKED on HALTED_SAFE
2 reacquire critical heap access
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 complete GC, release VM access
   2 HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 reacquire VM access <- 2 BLOCKED on HALT_INSPECT

I believe this may be fixed by making haltThreadForInspection mark the inspecting thread as NOT_SAFE:

Initial state
   1 VM_ACCESS
   2 VM_ACCESS
   3 VM_ACCESS
1 attempt to halt 2
   2 VM_ACCESS | HALT_INSPECT
1 mark unsafe, release VM access
   1 NOT_SAFE
1 wait for 2 to have HALT_INSPECT and no VM_ACCESS         <- 1 BLOCKED on HALT_INSPECT
3 acquire safe point exclusive
3 release VM access
   3 NONE
3 halt threads
   1 NOT_SAFE | REQUEST_SAFE
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE
3 wait for response from 1 and 2 to have neither VM_ACCESS nor NOT_SAFE <- 3 BLOCKED
2 out of line allocate causing a GC which has to back off exclusive
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 release critical heap access
   2 NOT_SAFE | HALT_INSPECT | REQUEST_SAFE | HALT_EXCLUSIVE
1 wakes up, reacquires VM access
   1 VM_ACCESS | NOT_SAFE | REQUEST_SAFE
2 reacquire critical heap access
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 complete GC, release VM access
   2 HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 reacquire VM access <- 2 BLOCKED on HALT_INSPECT
1 completes inspection
   1 VM_ACCESS | REQUEST_SAFE
   2 REQUEST_SAFE | NOT_SAFE
2 wakes up
   2 VM_ACCESS | REQUEST_SAFE | NOT_SAFE
2 completes allocation
   2 VM_ACCESS | REQUEST_SAFE

1 and 2 eventually run to safe point allowing 3 to run

gacholio · 2022-06-03T15:53:22Z

@jdekonin So you can track this.

See details in eclipse-openj9#15201 Signed-off-by: Graham Chapman <graham_chapman@ca.ibm.com>

amicic · 2022-06-03T17:29:49Z

Jenkins test sanity all jdk11

pshipton · 2022-06-03T18:25:19Z

jenkins compile win32 jdk8

gacholio · 2022-06-03T22:35:11Z

All approvals and testing done, so I'm going to bend the rules and merge this to facilitate the final part of the change.

pshipton · 2022-06-04T14:47:02Z

There is an assertion in /job_output.php?id=35652328, which reproduced 2/50 in a grinder, but 0/200 on the previous build.

02:49:29.389 0x94f00    j9mm.107    *   ** ASSERTION FAILED ** at EnvironmentBase.cpp:534: ((false && ((heapBase <= objectPtr) && (heapTop > objectPtr))))

Also seen in the following:
https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_0/180/
cmdLineTester_jvmtitests_1

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_0/180/functional_test_output.tar.gz

Testing: re001
Test start time: 2022/06/04 04:48:58 Coordinated Universal Time
Running command: "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_0/openjdkbinary/j2sdk-image/bin/java"  -Xjit -Xgcpolicy:gencon -Xshareclasses:name=bcienabled,enableBCI -Xnocompressedrefs  -Xdump    -agentlib:jvmtitest=test:re001 -Xdisableexcessivegc -Dcom.ibm.tools.attach.enable=no -Dcom.ibm.tools.attach.enable=no -Xdump:none -Xmx128M -cp "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitest.jar" com.ibm.jvmti.tests.util.TestRunner
Time spent starting: 2 milliseconds
Time spent executing: 3074 milliseconds
Test result: FAILED
Output from test:
 [OUT] *** Testing [1/2]:	testOutOfMemoryObject
 [OUT] *** Test took 2455 milliseconds
 [OUT] OK
 [OUT] 
 [OUT] *** Testing [2/2]:	testOutOfMemoryString
 [ERR] 04:49:01.628 0xffff8c107a00    j9mm.107    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK11_aarch64_linux_Nightly/omr/gc/base/EnvironmentBase.cpp:534: ((false && ((heapBase <= objectPtr) && (heapTop > objectPtr))))
>> Success condition was not found: [Return code: 0]

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/237
java/util/StringJoiner/StringJoinerTest.java

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/237/openjdk_test_output.tar.gz

21:54:50  STDOUT:
21:54:50  test StringJoinerTest.OOM1(): success
21:54:50  test StringJoinerTest.OOM2(): success
21:54:50  STDERR:
21:54:50  JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2022/06/03 21:53:18 - please wait.
21:54:50  JVMDUMP032I JVM requested Java dump using '/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/aqa-tests/TKG/output_16543029739716/jdk_util_1/work/scratch/2/javacore.20220603.215318.19019.0001.txt' in response to an event
21:54:50  JVMDUMP010I Java dump written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/aqa-tests/TKG/output_16543029739716/jdk_util_1/work/scratch/2/javacore.20220603.215318.19019.0001.txt
21:54:50  JVMDUMP032I JVM requested Snap dump using '/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/aqa-tests/TKG/output_16543029739716/jdk_util_1/work/scratch/2/Snap.20220603.215318.19019.0002.trc' in response to an event
21:54:50  JVMDUMP010I Snap dump written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/aqa-tests/TKG/output_16543029739716/jdk_util_1/work/scratch/2/Snap.20220603.215318.19019.0002.trc
21:54:50  JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
21:54:50  01:53:20.706 0x3fffa84c5d00    j9mm.107    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK17_ppc64le_linux_Nightly/omr/gc/base/EnvironmentBase.cpp:534: ((false && ((heapBase <= objectPtr) && (heapTop > objectPtr))))

There may be more, there are a number of other failures I didn't look at yet. I'll finishing going through the builds later. For now, I'm reverting this.

pshipton · 2022-06-04T14:48:24Z

Reverted via #15226

amicic · 2022-06-04T15:05:08Z

saveObjects() expects non NULL
any test causing OOM will fail
EDIT: well, the allocating thread also has to have pending HALT (for inspection or similar), so it's not just any OOM that will cause the problem

This PR is a variant of eclipse-openj9#15201 with handling of saveObjects NULL case Primary issue: It was possible for a thread which is halted for inspection to resume execution after a GC, resulting in the stack walk crashing in the inspecting thread. This was introduced in eclipse-openj9#12257 . Secondary issue: If instrumentable object allocate is hooked, it was possible to HCR at JIT object allocation points. New restriction: The JVMTI extension event for instrumentable object allocate may now only be acquired at startup, not during late attach. As far as I know, there are no active users of this event. Note that NOT_AT_SAFE_POINT has two distinct if related meanings: - Normal exclusive VM access has priority over safe point Setting NOT_AT_SAFE_POINT around all GCing operations accomplishes this. - JIT optimization requires that there be no possibility of HCR at object allocation points Extending the range of NOT_AT_SAFE_POINT to cover the entire allocation path and disabling safe point if the instumentable object allocate event is hooked ensures this. This fix has several parts: - Pause after GC if halt has been requested After any possibly-GCing path, check to see if the thread should halt before resuming mutation. This fixes the reported problem of inspected threads continuing to run. This also requires that NOT_AT_SAFE_POINT be set across the entirety of the allocation path to prevent safe point from being acquired by the new halting code. - Fix object allocation event reporting If the instrumentable object allocation extension event was hooked, there was a timing hole where HCR could occur at an object allocation from the JIT, which is specificially what safe point HCR is meant to avoid. This is fixed by marking the thread NOT_AT_SAFE_POINT for the duration of the allocate. If the instrumentable object allocate event is hooked, disable safe point HCR as there is no way to safely report the event. - Mark all possibly GCing paths NOT_AT_SAFE_POINT This ensures that the GC has priority over safe point HCR. haltThreadForInspection also now marks the inspecting thread NOT_SAFE for the duration of the halt (see timing below). Signed-off-by: Aleksandar Micic <amicic@ca.ibm.com>

gacholio force-pushed the safe branch 5 times, most recently from 613d833 to 2adb4df Compare June 2, 2022 17:22

dmitripivkine reviewed Jun 2, 2022

View reviewed changes

runtime/gc_modron_startup/mgcalloc.cpp Show resolved Hide resolved

dmitripivkine reviewed Jun 2, 2022

View reviewed changes

runtime/gc_modron_startup/mgcalloc.cpp Show resolved Hide resolved

gacholio requested review from amicic and tajila June 2, 2022 19:57

Fix halted thread being allowed to continue execution

c177ada

See details in eclipse-openj9#15201 Signed-off-by: Graham Chapman <graham_chapman@ca.ibm.com>

gacholio force-pushed the safe branch from c24ea86 to c177ada Compare June 3, 2022 17:13

gacholio changed the title ~~Move set/clear of NOT_AT_SAFE_POINT~~ Fix halted thread being allowed to continue execution Jun 3, 2022

gacholio marked this pull request as ready for review June 3, 2022 17:25

amicic approved these changes Jun 3, 2022

View reviewed changes

dmitripivkine approved these changes Jun 3, 2022

View reviewed changes

tajila approved these changes Jun 3, 2022

View reviewed changes

gacholio merged commit abfb530 into eclipse-openj9:master Jun 3, 2022

gacholio deleted the safe branch June 3, 2022 22:35

pshipton mentioned this pull request Jun 4, 2022

Revert "Fix halted thread being allowed to continue execution" #15226

Merged

amicic mentioned this pull request Jun 4, 2022

Fix halted thread being allowed to continue execution v2 #15227

Merged

cjjdespres mentioned this pull request Jun 6, 2022

Internal JITServer test failures #14538

Open

ChengJin01 mentioned this pull request Oct 20, 2022

jdk19 MiniMix_aot_5m_0 hang #15917

Closed

pshipton mentioned this pull request Apr 3, 2023

Assertion failed currentThread->publicFlags #16480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix halted thread being allowed to continue execution #15201

Fix halted thread being allowed to continue execution #15201

gacholio commented Jun 2, 2022 •

edited

Loading

gacholio commented Jun 2, 2022

amicic commented Jun 2, 2022

dmitripivkine commented Jun 2, 2022

gacholio commented Jun 2, 2022

gacholio commented Jun 3, 2022

gacholio commented Jun 3, 2022

amicic commented Jun 3, 2022

pshipton commented Jun 3, 2022

gacholio commented Jun 3, 2022

pshipton commented Jun 4, 2022 •

edited

Loading

pshipton commented Jun 4, 2022

amicic commented Jun 4, 2022 •

edited

Loading

Fix halted thread being allowed to continue execution #15201

Fix halted thread being allowed to continue execution #15201

Conversation

gacholio commented Jun 2, 2022 • edited Loading

gacholio commented Jun 2, 2022

amicic commented Jun 2, 2022

dmitripivkine commented Jun 2, 2022

gacholio commented Jun 2, 2022

gacholio commented Jun 3, 2022

gacholio commented Jun 3, 2022

amicic commented Jun 3, 2022

pshipton commented Jun 3, 2022

gacholio commented Jun 3, 2022

pshipton commented Jun 4, 2022 • edited Loading

pshipton commented Jun 4, 2022

amicic commented Jun 4, 2022 • edited Loading

gacholio commented Jun 2, 2022 •

edited

Loading

pshipton commented Jun 4, 2022 •

edited

Loading

amicic commented Jun 4, 2022 •

edited

Loading