Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Corruption in JNI interface #2472

Closed
revans2 opened this issue Jun 20, 2017 · 7 comments
Closed

Memory Corruption in JNI interface #2472

revans2 opened this issue Jun 20, 2017 · 7 comments

Comments

@revans2
Copy link

revans2 commented Jun 20, 2017

I noticed when trying to upgrade to v5.3.6 that when I ran the jtest that it failed 100% of the time with memory corruption errors.

*** Error in `java': malloc(): memory corruption (fast): 0x00002afb244a0200 ***

Aborted (core dumped)
*** glibc detected *** /libexec64/jdk64-1.8.0/jre/bin/java: malloc(): memory corruption (fast): 0x00007f29f44a0190 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75f3e)[0x7f29fa996f3e]
/lib64/libc.so.6(+0x7a608)[0x7f29fa99b608]
/lib64/libc.so.6(__libc_malloc+0x5c)[0x7f29fa99bbfc]
/libexec64/jdk64-1.8.0/jre/lib/amd64/server/libjvm.so(+0x9138c5)[0x7f29fa25a8c5]
/libexec64/jdk64-1.8.0/jre/lib/amd64/server/libjvm.so(+0x6dd2ca)[0x7f29fa0242ca]
/libexec64/jdk64-1.8.0/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_find+0x57)[0x7f29f908f017]
[0x7f29e5335fbc]
======= Memory map: ========
...

or other times

Run: org.rocksdb.ReadOptionsTest testing now -> totalOrderSeek
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f39e0da6bba, pid=16282, tid=139886574049024
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x79bba]

I found that it only happens when debug is disabled DEBUG=0 because of that travis does not appear to be detecting it.

I setup my own travis run https://travis-ci.org/revans2/rocksdb/builds/245099099 with this enabled and it is failing on both clang and gcc for ubuntu.

I have reproduced the issue using:

g++ (GCC) 5.3.1 20160406 (Red Hat 5.3.1-6)
Red Hat Enterprise Linux Server release 6.8 (Santiago)
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

and also:

g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9)
Red Hat Enterprise Linux Server release 7.1 (Maipo)
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

The exact command I ran was

gmake V=1 PORTABLE=1 DEBUG_LEVEL=0 clean jclean rocksdbjava jtest

Removing DEBUG_LEVEL=0 make everything work smoothly.

Using git bisect I traced it down to #1890 specifically version c6d464a.

For some reason master is not showing this issue. Using git bisect again #2366 (git version c2c62ad) appears to have "fixed" the issue, but unit tests are failing for jtest so I don't know if it really fixed it or if it is just somehow masking the failures.

I don't know how the official 5.4.5, 5.3.6, or 5.3.4 releases or rocksdbjni were built but if they were built with DEBUG=0 anyone using them runs the risk of memory corruption.

@merlimat
Copy link
Contributor

@adamretter do you have any idea or suspicion on what could be the trigger of this issue?

@sagar0
Copy link
Contributor

sagar0 commented Jun 21, 2017

The official release jars are not typically built with DEBUG=0 flag at our end. They are built with something like PORTABLE=1 make rocksdbjavastatic (substitute with either rocksdbjavastaticrelease or rocksdbjavastaticreleasedocker based on the need). @adamretter please confirm.

I am not yet sure on what caused the corruption, but it is interesting to note in the above description that ReadOptionsTest which was failing occasionally surprisingly seems to have gotten fixed by reordering the fields in ReadOptions in #2366. How?!?

@revans2
Copy link
Author

revans2 commented Jun 21, 2017

@sagar0 I am at a loss for this too because DEBUG=0 only appears to add -DNDEBUG to the command line and remove -Werror

# if we're compiling for release, compile without debug code (-DNDEBUG) and
# don't treat warnings as errors
ifeq ($(DEBUG_LEVEL),0)
OPT += -DNDEBUG
DISABLE_WARNING_AS_ERROR=1
else
$(warning Warning: Compiling in debug mode. Don't use the resulting binary in production)
endif

but that warning scared me into thinking what else might be happening.

In the java Makefile it also removes -ea -Xcheck:jni from the command line. It could be something related to the assertions not being enabled and we are triggering the issue in one of the assertions, or it could be something with how java is checking the JNI.

I'll try to play around with those and see what I can come up with.

@revans2
Copy link
Author

revans2 commented Jun 21, 2017

It has something to do with assertions. -ea on the java command line. The patch

index 5725c99..3b2ba0f 100644
--- a/java/Makefile
+++ b/java/Makefile
@@ -142,8 +142,11 @@ JAVA_ARGS? =
 # When debugging add -Xcheck:jni to the java args
 ifneq ($(DEBUG_LEVEL),0)
        JAVA_ARGS = -ea -Xcheck:jni
+else
+       JAVA_ARGS = -ea
 endif

+
 clean:
        $(AM_V_at)rm -rf include/*
        $(AM_V_at)rm -rf test-libs/

Fixes the problem.

My guess right now is that there is an assertion protecting the API from some very bad behavior, and there is a test that is doing that bad behavior. With assertions disabled it lets the bad call through and memory is corrupted. I feel much better now, but it would be nice to know what that assertion is, but it is not that critical.

@adamretter adamretter self-assigned this Jun 23, 2017
@adamretter
Copy link
Collaborator

adamretter commented Jul 7, 2017

Okay do I think I can see what is happening here. If we study for example ReadOptionsTest#failsetSnapshotUninitialized.

We can see that it explicitly calls ReadOptions#close() in ReadOptionsTest#setupUninitializedReadOptions. After which it calls ReadOptions#setSnapshot(Snapshot), the code of which looks like:

  public ReadOptions setSnapshot(final Snapshot snapshot) {
    assert(isOwningHandle());
    if (snapshot != null) {
      setSnapshot(nativeHandle_, snapshot.nativeHandle_);
    } else {
      setSnapshot(nativeHandle_, 0l);
    }
    return this;
  }

So the problem is that, when you have Java assertions disabled, i.e. when running the tests in non-debug mode, some of the tests in ReadOptionsTest, first close the ReadOptions instance (which deletes the underlying C++ object), they then try and call ReadOptions#setSnapshot(Snapshot) which via JNI executes the C++:

reinterpret_cast<rocksdb::ReadOptions*>(jhandle)->snapshot =
    reinterpret_cast<rocksdb::Snapshot*>(jsnapshot);

but that causes a use-after-free leading to a SIGSEGV as the jhandle is now a nullptr from the previous close call.

You don't see this when Java assertions are enabled because assert(isOwningHandle()); causes a java.lang.AssertionError which means that the native method setSnapshot(long, long) is never called. The unit tests are coded in such a way, that they expect to run with assertions enabled and they test that an AssertionError was thrown.

I think we have a few different options here:

  1. We decide that if a Java object has already been closed, i.e. (the underlying C++ object was deleted or ownership was transferred), and ownership is required, then we throw a checked exception instead of using an assert(). This would mean the tests would work in both debug and non-debug modes. This would have the advantage of helping Java developers avoid accidents which lead to SIGSEGV. It would add a very small overhead, where an additional Atomic CAS would be required on each function call.

  2. We don't change the behaviour as such, but we modify the tests that check for AssertionError in either one of two ways:

  3. They are not executed if assertions are disabled. We can check for the assertion enabled status at runtime.

  4. We modify the tests, so that they check by means other than an AssertionError. This might not be easy, or might require (1).

  5. We do nothing. We just modify the Makefile so that tests are always run with assertions enabled.

My preferred approach is probably (1) as it prevents Java developers from causing SIGSEGV, which they should never have to worry about. However the overhead could be an issue.

@sagar0 @siying your comments please on the approach to take...

@sagar0
Copy link
Contributor

sagar0 commented Jul 10, 2017

@adamretter Thanks for investigating the issue, and providing a detailed explanation with various options!

I agree with you that it might be better to go with option (1) i.e. throw checked exceptions when dealing with use-after-free cases; I agree that it is a lot more JVM-developer-friendly way. With regard to the overhead, I believe that people using RocksJava already accept the larger overhead of running their service on a JVM, and the additional overhead of the CAS you mention in this particular issue may not be significant. Lets try to get some performance numbers to quantify the additional overhead, if any. We can never be certain without benchmarking the overhead.

I also think (3) should be the default. Since all our unit tests expect assertions to be enabled, there is no point in running tests without asserts enabled. I am willing to be wrong on this ... and open to discussion, if others in the community have a different view.

@gfosco
Copy link
Contributor

gfosco commented Jan 10, 2018

Closing this via automation due to lack of activity. If discussion is still needed here, please re-open or create a new/updated issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants