Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8255452: Doing GC during JVMTI MethodExit event posting breaks return oop #930

Closed
wants to merge 4 commits into from

Conversation

fisk
Copy link
Contributor

@fisk fisk commented Oct 29, 2020

The imasm::remove_activation() call does not deal with safepoints very well. However, when the MethodExit JVMTI event is being called, we call into the runtime in the middle of remove_activation(). If the value being returned is an object type, then the top-of-stack contains the oop. However, the GC does not traverse said oop in any oop map, because it is simply not expected that we safepoint in the middle of remove_activation().

The JvmtiExport::post_method_exit() function we end up calling, reads the top-of-stack oop, and puts it in a handle. Then it calls JVMTI callbacks, that eventually call Java and a bunch of stuff that safepoints. So after the JVMTI callback, we can expect the top-of-stack oop to be broken. Unfortunately, when we continue, we therefore end up returning a broken oop.

Notably, the fact that InterpreterRuntime::post_method_exit is a JRT_ENTRY, is wrong, as we can safepoint on the way back to Java, which will break the return oop in a similar way. So this patch makes it a JRT_BLOCK_ENTRY, moving the transition to VM and back, into a block of code that is protected against GC. Before the JRT_BLOCK is called, we stash away the return oop, and after the JRT_BLOCK_END, we restore the top-of-stack oop. In the path when InterpreterRuntime::post_method_exit is called when throwing an exception, we don't have the same problem of retaining an oop result, and hence the JRT_BLOCK/JRT_BLOCK_END section is not performed in this case; the logic is the same as before for this path.

This is a JVMTI bug that has probably been around for a long time. It crashes with all GCs, but was discovered recently after concurrent stack processing, as StefanK has been running better GC stressing code in JVMTI, and the bug reproduced more easily with concurrent stack processing, as the timings were a bit different. The following reproducer failed pretty much 100% of the time:
while true; do make test JTREG="RETAIN=all" TEST=test/hotspot/jtreg/vmTestbase/nsk/jdi/MethodExitEvent/returnValue/returnValue003/returnValue003.java TEST_OPTS_JAVA_OPTIONS="-XX:+UseZGC -Xmx2g -XX:ZCollectionInterval=0.0001 -XX:ZFragmentationLimit=0.01 -XX:+VerifyOops -XX:+ZVerifyViews -Xint" ; done

With my fix I can run this repeatedly without any more failures. I have also sanity checked the patch by running tier 1-5, so that it does not introduces any new issues on its own. I have also used Stefan's nice external GC stressing with jcmd technique that was used to trigger crashes with other GCs, to make sure said crashes no longer reproduce either.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Testing

Linux x64 Linux x86 Windows x64 macOS x64
Build ✔️ (5/5 passed) ✔️ (2/2 passed) ❌ (1/2 failed) ✔️ (2/2 passed)
Test (tier1) ✔️ (9/9 passed) ✔️ (9/9 passed) ✔️ (9/9 passed)

Failed test task

Issue

  • JDK-8255452: Doing GC during JVMTI MethodExit event posting breaks return oop

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/930/head:pull/930
$ git checkout pull/930

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 29, 2020

👋 Welcome back eosterlund! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 29, 2020
@openjdk
Copy link

openjdk bot commented Oct 29, 2020

@fisk The following labels will be automatically applied to this pull request:

  • hotspot
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added serviceability serviceability-dev@openjdk.org hotspot hotspot-dev@openjdk.org labels Oct 29, 2020
@mlbridge
Copy link

mlbridge bot commented Oct 29, 2020

Webrevs

}

if (exception_exit) {
post_method_exit_inner(thread, mh, state, exception_exit, current_frame, result, value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for exception exit, you also need JRT_BLOCK because you want the transition to thread_in_VM for this code, since JRT_BLOCK_ENTRY doesn't do the transition. It should be safe for exception exit and retain the old behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having a look coleen. In fact, not doing the JRT_BLOCK for the exception entry is intentional, because that entry goes through a different JRT_ENTRY (not JRT_BLOCK_ENTRY), that already transitions. So if I do the JRT_BLOCK for the exception path, it asserts saying hey you are already in VM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's actually horrible. I wonder if it's possible to hoist saving the result oop into the InterpreterRuntime entry. And pass the Handle into JvmtiExport::post_method_exit().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that first, and ended up with a bunch of non-trivial code duplication instead, as reading the oop is done in both paths but for different reasons. One to preserve/restore it (interpreter remove_activation entry), but also inside of JvmtiExport::post_method_exit() so that it can be passed into the MethodExit. I will give it another shot and see if it is possible to refactor it in a better way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I uploaded a CR that does pretty much what you suggested, ish. Hope you like it!

@reinrich
Copy link
Member

Hi Erik,

is it possible for GC to mistake a primitive value for a reference when posting the exit event?

My understanding is: we are at a random bci of a method that is forced to return early. The expression stack is emptied and the return value is pushed on the expression stack then we call into the interpreter runtime to post the JVMTI method exit event during which we come to a safepoint for GC. The oop map for the bci does not cover this forced early return and if the return value is an object then the reference pushed on the expression stack before is not updated by GC. With your fix the value is updated if it is a reference.

If this is correct then to me it appears as if GC can also crash because the oop map for the random bci tells there has to be a reference at the stack position of the return value if it actually is a primitive value.

@dean-long
Copy link
Member

dean-long commented Oct 30, 2020

@openjdk
Copy link

openjdk bot commented Oct 30, 2020

@fisk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8255452: Doing GC during JVMTI MethodExit event posting breaks return oop

Reviewed-by: coleenp, dlong, rrich, sspitsyn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 125 new commits pushed to the master branch:

  • 97a81ce: 8253385: annotation processors remove varargs information from record components
  • 166c728: 8255858: Add debug agent support for storing thread names
  • a0ade22: 8255900: x86: Reduce impact when VerifyOops is disabled
  • 29db1dc: 8255886: Shenandoah: Resolve cset address truncation and register clash in interpreter LRB
  • 26e7ef7: 8252870: Finalize (remove "incubator" from) jpackage
  • 804bd72: 8255128: linux x86 build failure with libJNIPoint.c
  • 160759c: 8255838: Use 32-bit immediate movslq in macro assembler if 64-bit value fits in 32 bits on x86_64
  • 94ace03: 8255863: Clean up test/jdk/java/lang/invoke/defineHiddenClass/BasicTest.java
  • 2da9981: 8255862: Remove @SuppressWarnings from sun.misc.Unsafe
  • 5348298: 8255855: appcds/jigsaw/NewModuleFinderTest.java test failed due to unexpected NPE
  • ... and 115 more: https://git.openjdk.java.net/jdk/compare/38574d51692b91bf46e578b8434b50de825b7b52...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 30, 2020
@fisk
Copy link
Contributor Author

fisk commented Oct 30, 2020

I think you've discovered JDK-6449023. And you fix looks like the workaround I tried:
https://bugs.openjdk.java.net/browse/JDK-6449023?focusedCommentId=14206078&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14206078

Oh wow. I had no idea people have been having issues with this since 2009! Thanks for the pointer. Well, let's hope we can finally close it now after marinating the bug for 11 years.

@fisk
Copy link
Contributor Author

fisk commented Oct 30, 2020

Hi Erik,

is it possible for GC to mistake a primitive value for a reference when posting the exit event?

My understanding is: we are at a random bci of a method that is forced to return early. The expression stack is emptied and the return value is pushed on the expression stack then we call into the interpreter runtime to post the JVMTI method exit event during which we come to a safepoint for GC. The oop map for the bci does not cover this forced early return and if the return value is an object then the reference pushed on the expression stack before is not updated by GC. With your fix the value is updated if it is a reference.

If this is correct then to me it appears as if GC can also crash because the oop map for the random bci tells there has to be a reference at the stack position of the return value if it actually is a primitive value.

I think what you are saying is true. Note though that the return value of ForceEarlyReturn is installed with a handshake. The handshake polls of the interpreter are emitted in loop backedges and returns. At loop backedges, the expression stack is empty (required by OSR), and at returns the types match correctly. However, if an arbitrary bytecode performs a runtime call with call_VM() while the bottom of the expression stack is an oop, then I think there is an issue. At that call_VM, the early return value could get installed, and when the C++ function returns, we check for early returns, further dispatching to an unwind routine that posts the MethodExit notification. If we GC during this MethodExit notification, then I think you can crash the GC. The GC code generates an oop map for the frame, checking what the types in the expression stack should be. The early return int is pushed on the slot intersecting with the bottom entry in the expression stack. That bottom entry could be an oop, and the early return value could be an int. Then the early return int will be passed to the oop closure, which should result in a crash.

So I suspect that almost always, the handshake installing the ForceEarlyReturn value is installed with a handshake in a bytecode backedge or at a return, where the interpreter safepoint polls for the fast path code. Then you won't notice the issue. But in the rare scenario that the ForceEarlyReturn value is installed in a slow path call from a random bytecode... I can't see how that would work correctly.

@fisk
Copy link
Contributor Author

fisk commented Oct 30, 2020

Hi Erik,
is it possible for GC to mistake a primitive value for a reference when posting the exit event?
My understanding is: we are at a random bci of a method that is forced to return early. The expression stack is emptied and the return value is pushed on the expression stack then we call into the interpreter runtime to post the JVMTI method exit event during which we come to a safepoint for GC. The oop map for the bci does not cover this forced early return and if the return value is an object then the reference pushed on the expression stack before is not updated by GC. With your fix the value is updated if it is a reference.
If this is correct then to me it appears as if GC can also crash because the oop map for the random bci tells there has to be a reference at the stack position of the return value if it actually is a primitive value.

I think what you are saying is true. Note though that the return value of ForceEarlyReturn is installed with a handshake. The handshake polls of the interpreter are emitted in loop backedges and returns. At loop backedges, the expression stack is empty (required by OSR), and at returns the types match correctly. However, if an arbitrary bytecode performs a runtime call with call_VM() while the bottom of the expression stack is an oop, then I think there is an issue. At that call_VM, the early return value could get installed, and when the C++ function returns, we check for early returns, further dispatching to an unwind routine that posts the MethodExit notification. If we GC during this MethodExit notification, then I think you can crash the GC. The GC code generates an oop map for the frame, checking what the types in the expression stack should be. The early return int is pushed on the slot intersecting with the bottom entry in the expression stack. That bottom entry could be an oop, and the early return value could be an int. Then the early return int will be passed to the oop closure, which should result in a crash.

So I suspect that almost always, the handshake installing the ForceEarlyReturn value is installed with a handshake in a bytecode backedge or at a return, where the interpreter safepoint polls for the fast path code. Then you won't notice the issue. But in the rare scenario that the ForceEarlyReturn value is installed in a slow path call from a random bytecode... I can't see how that would work correctly.

Looking more closely, I gotta say I have no idea why there is a call to clear the expression stack at all in TemplateInterpreterGenerator::generate_earlyret_entry_for(). It is unclear to me what problem if any that solves. It does however seem like it introduces this problem of having a forced int return value intersect with an oop in the expression stack for a BCI performing slow-path calls into the VM. Simply removing the code that clears the expression stack, removes the issue, and I can't see that it introduces any other issue. Anyway, I think this is a separate bug. Do you mind if I push the fix for that bug, as a different RFR? It will likely involve poking around at more platform dependent code.

@reinrich
Copy link
Member

I think what you are saying is true. Note though that the return value of
ForceEarlyReturn is installed with a handshake. The handshake polls of the
interpreter are emitted in loop backedges and returns

Yes right, I wasn't really aware of this. The thread has to be suspended though
before the forced return (with a vm operation).

I gotta say I have no idea why there is a call to clear the expression stack
at all in TemplateInterpreterGenerator::generate_earlyret_entry_for(). It is
unclear to me what problem if any that solves.

Don't really know either. Only thing I currently can think of is that the
expression stack could overflow, not on x86 though.

Anyway, I think this is a separate bug. Do you mind if I push the fix for that
bug, as a different RFR?

Not at all.

It will likely involve poking around at more platform dependent code.

Likely. Another solution might be to delay loading the return value until the
activation is removed.

Copy link
Contributor

@sspitsyn sspitsyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Erik,

Nice discovery! Indeed, this is a long standing issue. It looks good in general.
I agree with Coleen, it would be nice if there is an elegant way to move the oop_result saving/restoring code to InterpreterRuntime::post_method_exit. Otherwise, I'm okay with what you have now.
It is also nice discovery of the issue with clearing the expression stack. I think, it was my mistake in the initial implementation of the ForceEarlyReturn when I followed the PopFrame implementation pattern. It is good to separate it from the current fix.

Thanks,
Serguei

@fisk
Copy link
Contributor Author

fisk commented Nov 2, 2020

I uploaded a new commit to perform some refactoring as requested by Coleen and Serguei. I made the oop save/restore + JRT_BLOCK logic belong only to the path taken from InterpreterRuntime::post_method_exit. An inner posting method is called both from that path and from JvmtiExport::notice_unwind_due_to_exception. I think the result is an improvement in terms of how clear it is. I didn't want to move logic all the way back to InterpreterRuntime::post_method_exit though, as I don't think it looks pretty to have large chunks of JVMTI implementation details in the interpreterRuntime.cpp file. So I did basically what you suggested, with the slight difference of moving all the JVMTI implementation into the JVMTI file instead, which is just called from InterpreterRuntime::post_method_exit. Hope you are okay with this!

@fisk
Copy link
Contributor Author

fisk commented Nov 2, 2020

Hi Erik,

Nice discovery! Indeed, this is a long standing issue. It looks good in general.
I agree with Coleen, it would be nice if there is an elegant way to move the oop_result saving/restoring code to InterpreterRuntime::post_method_exit. Otherwise, I'm okay with what you have now.
It is also nice discovery of the issue with clearing the expression stack. I think, it was my mistake in the initial implementation of the ForceEarlyReturn when I followed the PopFrame implementation pattern. It is good to separate it from the current fix.

Thanks,
Serguei

Thanks for reviewing this Serguei. And thanks for confirming our suspicions regarding clearing of the expression stack. I wasn't sure if anyone would be around that knew how it ended up there!
I made the refactoring that you and Coleen wanted, I think. Hope you like it!

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better. Just to have the JRT_BLOCK be unconditional is an improvement.

// for any thread that actually wants method exit, interp_only_mode is set
return;
}

// return a flag when a method terminates by throwing an exception
// i.e. if an exception is thrown and it's not caught by the current method
bool exception_exit = state->is_exception_detected() && !state->is_exception_caught();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only applies to the case where the post_method_exit comes from remove_activation? Good to have it only on this path in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. There might be other cases, when remove_activation is called by the exception code. That's why I didn't want to change it to just true in this path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The post_method_exit can come from Zero interpreter:

src/hotspot/share/interpreter/zero/bytecodeInterpreter.cpp:            
     CALL_VM_NOCHECK(InterpreterRuntime::post_method_exit(THREAD));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the case of exception_exit is a condition from the call from notice_unwind_due_to_exception() and not the calls from InterpreterRuntime::post_method_exit() for both callers. But maybe if the exception is set and not caught, this post_method_exit() is called when unwinding to the exception handler. I can't tell, so leave it to be safe.

@sspitsyn
Copy link
Contributor

sspitsyn commented Nov 2, 2020

Erik,

Thank you for the update! It looks more elegant.

One concern is that after the move of this fragment to the post_method_exit_inner:

1614   if (state == NULL || !state->is_interp_only_mode()) {
1615     // for any thread that actually wants method exit, interp_only_mode is set
1616     return;
1617   }

there is no guarantee that the current frame is interpreted below:

1580     if (!exception_exit) {
1581       oop oop_result;
1582       BasicType type = current_frame.interpreter_frame_result(&oop_result, &value);
 . . .
1597   if (result.not_null() && !mh->is_native()) {
1598     // We have to restore the oop on the stack for interpreter frames
1599     *(oop*)current_frame.interpreter_frame_tos_address() = result();
1600   }

Probably, extra checks for current_frame.is_interpreted_frame() in these fragments will be sufficient.

@fisk
Copy link
Contributor Author

fisk commented Nov 3, 2020

Erik,

Thank you for the update! It looks more elegant.

One concern is that after the move of this fragment to the post_method_exit_inner:

1614   if (state == NULL || !state->is_interp_only_mode()) {
1615     // for any thread that actually wants method exit, interp_only_mode is set
1616     return;
1617   }

there is no guarantee that the current frame is interpreted below:

1580     if (!exception_exit) {
1581       oop oop_result;
1582       BasicType type = current_frame.interpreter_frame_result(&oop_result, &value);
 . . .
1597   if (result.not_null() && !mh->is_native()) {
1598     // We have to restore the oop on the stack for interpreter frames
1599     *(oop*)current_frame.interpreter_frame_tos_address() = result();
1600   }

Probably, extra checks for current_frame.is_interpreted_frame() in these fragments will be sufficient.

That makes sense. Added a check in the latest version that we are in interp only mode.

@sspitsyn
Copy link
Contributor

sspitsyn commented Nov 3, 2020

Hi Erik,

I'm not sure, if this fragment is still needed:

1620   if (state == NULL || !state->is_interp_only_mode()) {
1621     // for any thread that actually wants method exit, interp_only_mode is set
1622     return;
1623   }

Also, can it be that this condition is true:
(state == NULL || !state->is_interp_only_mode())
but the top frame is interpreted?
If so, then should we still safe/restore the result oop over a possible safepoint?

Thanks,
Serguei

@fisk
Copy link
Contributor Author

fisk commented Nov 4, 2020

Hi Erik,

I'm not sure, if this fragment is still needed:

1620   if (state == NULL || !state->is_interp_only_mode()) {
1621     // for any thread that actually wants method exit, interp_only_mode is set
1622     return;
1623   }

Seems like it is not needed. I removed it.

Also, can it be that this condition is true:
(state == NULL || !state->is_interp_only_mode())
but the top frame is interpreted?
If so, then should we still safe/restore the result oop over a possible safepoint?

It could definitely be that the top frame is interpreted even though that condition is true. However, we now enter InterpreterRuntime::post_method_exit as a JRT_BLOCK_ENTRY call, which performs no transition (similar to JRT_LEAF). So if so we should just return back to the caller without doing anything, and no GC will happen in this path then. It is only when we perform the JRT_BLOCK and JRT_BLOCK_END that we allow GCs to happen, and we save/restore the result across that section.

Thanks,
/Erik

Thanks,
Serguei

@openjdk
Copy link

openjdk bot commented Nov 4, 2020

@fisk Unknown command erik - for a list of valid commands use /help.

Copy link
Contributor

@sspitsyn sspitsyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erik,

Thank you for the explanation!
I agree with you, so the fix is good.

Thanks,
Serguei

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good.

// for any thread that actually wants method exit, interp_only_mode is set
return;
}

// return a flag when a method terminates by throwing an exception
// i.e. if an exception is thrown and it's not caught by the current method
bool exception_exit = state->is_exception_detected() && !state->is_exception_caught();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the case of exception_exit is a condition from the call from notice_unwind_due_to_exception() and not the calls from InterpreterRuntime::post_method_exit() for both callers. But maybe if the exception is set and not caught, this post_method_exit() is called when unwinding to the exception handler. I can't tell, so leave it to be safe.

@fisk
Copy link
Contributor Author

fisk commented Nov 5, 2020

/integrate

@openjdk openjdk bot closed this Nov 5, 2020
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 5, 2020
@openjdk
Copy link

openjdk bot commented Nov 5, 2020

@fisk Since your change was applied there have been 139 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

Pushed as commit 3a02578.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk-notifier bot referenced this pull request Nov 5, 2020
… oop

Reviewed-by: coleenp, dlong, rrich, sspitsyn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

5 participants