Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix halted thread being allowed to continue execution #15201

Merged
merged 1 commit into from
Jun 3, 2022

Conversation

gacholio
Copy link
Contributor

@gacholio gacholio commented Jun 2, 2022

Primary issue: It was possible for a thread which is halted for inspection to resume execution after a GC, resulting in the stack walk crashing in the inspecting thread. This was introduced in #12257 .

Secondary issue: If instrumentable object allocate is hooked, it was possible to HCR at JIT object allocation points.

New restriction: The JVMTI extension event for instrumentable object allocate may now only be acquired at startup, not during late attach. As far as I know, there are no active users of this event.

Note that NOT_AT_SAFE_POINT has two distinct if related meanings:

  • Normal exclusive VM access has priority over safe point

Setting NOT_AT_SAFE_POINT around all GCing operations accomplishes this.

  • JIT optimization requires that there be no possibility of HCR at object allocation points

Extending the range of NOT_AT_SAFE_POINT to cover the entire allocation path and disabling safe point if the instumentable object allocate event is hooked ensures this.

This fix has several parts:

  • Pause after GC if halt has been requested

After any possibly-GCing path, check to see if the thread should halt before resuming mutation. This fixes the reported problem of inspected threads continuing to run. This also requires that NOT_AT_SAFE_POINT be set across the entirety of the allocation path to prevent safe point from being acquired by the new halting code.

  • Fix object allocation event reporting

If the instrumentable object allocation extension event was hooked, there was a timing hole where HCR could occur at an object allocation from the JIT, which is specificially what safe point HCR is meant to avoid.

This is fixed by marking the thread NOT_AT_SAFE_POINT for the duration of the allocate. If the instrumentable object allocate event is hooked, disable safe point HCR as there is no way to safely report the event.

  • Mark all possibly GCing paths NOT_AT_SAFE_POINT

This ensures that the GC has priority over safe point HCR.

haltThreadForInspection also now marks the inspecting thread NOT_SAFE for the duration of the halt (see timing below).

Signed-off-by: Graham Chapman graham_chapman@ca.ibm.com

@gacholio
Copy link
Contributor Author

gacholio commented Jun 2, 2022

@amicic @tajila Opening this for review/discussion while we get the first OMR PR merged.

@gacholio gacholio force-pushed the safe branch 5 times, most recently from 613d833 to 2adb4df Compare June 2, 2022 17:22
@amicic
Copy link
Contributor

amicic commented Jun 2, 2022

The changes look good to me, at least the fix part that prevent potential inspector thread to proceed after it requested inspection, but was incorrectly allowed to proceed with inspection, if the inspecting thread went through releasing/reacquiring JNI critical access and so mis-observing that HALT_INSPECTION flag has been raised. Now, we will properly check that flag after reacquire, and block if inspection pending.

But please elaborate more in the commit comment for future readers, why NOT_SAFE had to move to outter points and why this does not compromise deadlock prevention from #12257

@dmitripivkine
Copy link
Contributor

Can we get some nested cases for "set J9_PUBLIC_FLAGS_NOT_AT_SAFE_POINT/ do / clear J9_PUBLIC_FLAGS_NOT_AT_SAFE_POINT" pattern potentially? Should we assert if flag is set already?

@gacholio
Copy link
Contributor Author

gacholio commented Jun 2, 2022

Can we get some nested cases for "set J9_PUBLIC_FLAGS_NOT_AT_SAFE_POINT/ do / clear J9_PUBLIC_FLAGS_NOT_AT_SAFE_POINT" pattern potentially? Should we assert if flag is set already?

I explicitly handle the only nested case I believe we need to handle - exclusive VM acess sets NOT_AT_SAFE_POINT only if it was not already set, and clears it only if exclusive set it. I don't believe there are any other nested cases in this PR since we're putting the set/clear as far out as possible.

@gacholio gacholio requested review from amicic and tajila June 2, 2022 19:57
@gacholio
Copy link
Contributor Author

gacholio commented Jun 3, 2022

@amicic and I believe this code reintroduces the original hang.

Original issue:

Initial state
   1 VM_ACCESS
   2 VM_ACCESS
   3 VM_ACCESS
1 attempt to halt 2
   2 VM_ACCESS | HALT_INSPECT
1 release VM access
   1 NONE
1 wait for 2 to have HALT_INSPECT and no VM_ACCESS         <- 1 BLOCKED
3 acquire safe point exclusive
3 release VM access
   3 NONE
3 halt threads
   1 HALTED_SAFE
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE
3 wait for response from 2 to have neither VM_ACCESS nor NOT_SAFE <- 3 BLOCKED
2 out of line allocate causing a GC which has to back off exclusive
2 release critical heap access
   2 NOT_SAFE | HALT_INSPECT | REQUEST_SAFE
2 reacquire critical heap access blocks on HALT_INSPECT       <- 2 BLOCKED

Current code:

Initial state
   1 VM_ACCESS
   2 VM_ACCESS
   3 VM_ACCESS
1 attempt to halt 2
   2 VM_ACCESS | HALT_INSPECT
1 release VM access
   1 NONE
1 wait for 2 to have HALT_INSPECT and no VM_ACCESS         <- 1 BLOCKED on HALT_INSPECT
3 acquire safe point exclusive
3 release VM access
   3 NONE
3 halt threads
   1 HALTED_SAFE
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE
3 wait for response from 2 to have neither VM_ACCESS nor NOT_SAFE <- 3 BLOCKED
2 out of line allocate causing a GC which has to back off exclusive
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 release critical heap access
   2 NOT_SAFE | HALT_INSPECT | REQUEST_SAFE | HALT_EXCLUSIVE
1 wakes up, attempt to acquire VM access <- 1 BLOCKED on HALTED_SAFE
2 reacquire critical heap access
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 complete GC, release VM access
   2 HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 reacquire VM access <- 2 BLOCKED on HALT_INSPECT

I believe this may be fixed by making haltThreadForInspection mark the inspecting thread as NOT_SAFE:

Initial state
   1 VM_ACCESS
   2 VM_ACCESS
   3 VM_ACCESS
1 attempt to halt 2
   2 VM_ACCESS | HALT_INSPECT
1 mark unsafe, release VM access
   1 NOT_SAFE
1 wait for 2 to have HALT_INSPECT and no VM_ACCESS         <- 1 BLOCKED on HALT_INSPECT
3 acquire safe point exclusive
3 release VM access
   3 NONE
3 halt threads
   1 NOT_SAFE | REQUEST_SAFE
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE
3 wait for response from 1 and 2 to have neither VM_ACCESS nor NOT_SAFE <- 3 BLOCKED
2 out of line allocate causing a GC which has to back off exclusive
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 release critical heap access
   2 NOT_SAFE | HALT_INSPECT | REQUEST_SAFE | HALT_EXCLUSIVE
1 wakes up, reacquires VM access
   1 VM_ACCESS | NOT_SAFE | REQUEST_SAFE
2 reacquire critical heap access
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 complete GC, release VM access
   2 HALT_INSPECT | REQUEST_SAFE | NOT_SAFE
2 reacquire VM access <- 2 BLOCKED on HALT_INSPECT
1 completes inspection
   1 VM_ACCESS | REQUEST_SAFE
   2 REQUEST_SAFE | NOT_SAFE
2 wakes up
   2 VM_ACCESS | REQUEST_SAFE | NOT_SAFE
2 completes allocation
   2 VM_ACCESS | REQUEST_SAFE

1 and 2 eventually run to safe point allowing 3 to run

@gacholio
Copy link
Contributor Author

gacholio commented Jun 3, 2022

@jdekonin So you can track this.

See details in eclipse-openj9#15201

Signed-off-by: Graham Chapman <graham_chapman@ca.ibm.com>
@gacholio gacholio changed the title Move set/clear of NOT_AT_SAFE_POINT Fix halted thread being allowed to continue execution Jun 3, 2022
@gacholio gacholio marked this pull request as ready for review June 3, 2022 17:25
@amicic
Copy link
Contributor

amicic commented Jun 3, 2022

Jenkins test sanity all jdk11

@pshipton
Copy link
Member

pshipton commented Jun 3, 2022

jenkins compile win32 jdk8

@gacholio
Copy link
Contributor Author

gacholio commented Jun 3, 2022

All approvals and testing done, so I'm going to bend the rules and merge this to facilitate the final part of the change.

@gacholio gacholio merged commit abfb530 into eclipse-openj9:master Jun 3, 2022
@gacholio gacholio deleted the safe branch June 3, 2022 22:35
@pshipton
Copy link
Member

pshipton commented Jun 4, 2022

There is an assertion in /job_output.php?id=35652328, which reproduced 2/50 in a grinder, but 0/200 on the previous build.

02:49:29.389 0x94f00    j9mm.107    *   ** ASSERTION FAILED ** at EnvironmentBase.cpp:534: ((false && ((heapBase <= objectPtr) && (heapTop > objectPtr))))

Also seen in the following:
https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_0/180/
cmdLineTester_jvmtitests_1

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_0/180/functional_test_output.tar.gz

Testing: re001
Test start time: 2022/06/04 04:48:58 Coordinated Universal Time
Running command: "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_0/openjdkbinary/j2sdk-image/bin/java"  -Xjit -Xgcpolicy:gencon -Xshareclasses:name=bcienabled,enableBCI -Xnocompressedrefs  -Xdump    -agentlib:jvmtitest=test:re001 -Xdisableexcessivegc -Dcom.ibm.tools.attach.enable=no -Dcom.ibm.tools.attach.enable=no -Xdump:none -Xmx128M -cp "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitest.jar" com.ibm.jvmti.tests.util.TestRunner
Time spent starting: 2 milliseconds
Time spent executing: 3074 milliseconds
Test result: FAILED
Output from test:
 [OUT] *** Testing [1/2]:	testOutOfMemoryObject
 [OUT] *** Test took 2455 milliseconds
 [OUT] OK
 [OUT] 
 [OUT] *** Testing [2/2]:	testOutOfMemoryString
 [ERR] 04:49:01.628 0xffff8c107a00    j9mm.107    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK11_aarch64_linux_Nightly/omr/gc/base/EnvironmentBase.cpp:534: ((false && ((heapBase <= objectPtr) && (heapTop > objectPtr))))
>> Success condition was not found: [Return code: 0]

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/237
java/util/StringJoiner/StringJoinerTest.java

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/237/openjdk_test_output.tar.gz

21:54:50  STDOUT:
21:54:50  test StringJoinerTest.OOM1(): success
21:54:50  test StringJoinerTest.OOM2(): success
21:54:50  STDERR:
21:54:50  JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2022/06/03 21:53:18 - please wait.
21:54:50  JVMDUMP032I JVM requested Java dump using '/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/aqa-tests/TKG/output_16543029739716/jdk_util_1/work/scratch/2/javacore.20220603.215318.19019.0001.txt' in response to an event
21:54:50  JVMDUMP010I Java dump written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/aqa-tests/TKG/output_16543029739716/jdk_util_1/work/scratch/2/javacore.20220603.215318.19019.0001.txt
21:54:50  JVMDUMP032I JVM requested Snap dump using '/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/aqa-tests/TKG/output_16543029739716/jdk_util_1/work/scratch/2/Snap.20220603.215318.19019.0002.trc' in response to an event
21:54:50  JVMDUMP010I Snap dump written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/aqa-tests/TKG/output_16543029739716/jdk_util_1/work/scratch/2/Snap.20220603.215318.19019.0002.trc
21:54:50  JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
21:54:50  01:53:20.706 0x3fffa84c5d00    j9mm.107    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK17_ppc64le_linux_Nightly/omr/gc/base/EnvironmentBase.cpp:534: ((false && ((heapBase <= objectPtr) && (heapTop > objectPtr))))

There may be more, there are a number of other failures I didn't look at yet. I'll finishing going through the builds later. For now, I'm reverting this.

@pshipton
Copy link
Member

pshipton commented Jun 4, 2022

Reverted via #15226

@amicic
Copy link
Contributor

amicic commented Jun 4, 2022

saveObjects() expects non NULL
any test causing OOM will fail
EDIT: well, the allocating thread also has to have pending HALT (for inspection or similar), so it's not just any OOM that will cause the problem

amicic added a commit to amicic/openj9 that referenced this pull request Jun 4, 2022
This PR is a variant of
eclipse-openj9#15201 with
handling of saveObjects NULL case

Primary issue: It was possible for a thread which is halted for
inspection to resume execution after a GC, resulting in the stack walk
crashing in the inspecting thread. This was introduced in
eclipse-openj9#12257 .

Secondary issue: If instrumentable object allocate is hooked, it was
possible to HCR at JIT object allocation points.

New restriction: The JVMTI extension event for instrumentable object
allocate may now only be acquired at startup, not during late attach. As
far as I know, there are no active users of this event.

Note that NOT_AT_SAFE_POINT has two distinct if related meanings:

- Normal exclusive VM access has priority over safe point

Setting NOT_AT_SAFE_POINT around all GCing operations accomplishes this.

- JIT optimization requires that there be no possibility of HCR at
object allocation points

Extending the range of NOT_AT_SAFE_POINT to cover the entire allocation
path and disabling safe point if the instumentable object allocate event
is hooked ensures this.

This fix has several parts:

- Pause after GC if halt has been requested

After any possibly-GCing path, check to see if the thread should halt
before resuming mutation. This fixes the reported problem of inspected
threads continuing to run. This also requires that NOT_AT_SAFE_POINT be
set across the entirety of the allocation path to prevent safe point
from being acquired by the new halting code.

- Fix object allocation event reporting

If the instrumentable object allocation extension event was hooked,
there was a timing hole where HCR could occur at an object allocation
from the JIT, which is specificially what safe point HCR  is meant to
avoid.

This is fixed by marking the thread NOT_AT_SAFE_POINT for the duration
of the allocate. If the instrumentable object allocate event is hooked,
disable safe point HCR as there is no way to safely report the event.

- Mark all possibly GCing paths NOT_AT_SAFE_POINT

This ensures that the GC has priority over safe point HCR.

haltThreadForInspection also now marks the inspecting thread NOT_SAFE
for the duration of the halt (see timing below).

Signed-off-by: Aleksandar Micic <amicic@ca.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants