Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object is cleaned too soon during unit testing #93

Open
pedrolamarao opened this issue Sep 25, 2024 · 21 comments
Open

Object is cleaned too soon during unit testing #93

pedrolamarao opened this issue Sep 25, 2024 · 21 comments

Comments

@pedrolamarao
Copy link

pedrolamarao commented Sep 25, 2024

We have a modular system with a service provider scheme.

One of our providers is gRPC-based.
We use a Cleaner to shutdown the managed channel when the provided service is collected.
This is how the provider builds the service:

final var builder = Grpc.newChannelBuilder(target,credentials);
builder.defaultServiceConfig(serviceConfig);
final var channel = builder.build();
final var service = new GrpcFoo( getLogger(GrpcFoo.class), newBlockingStub(channel) );
cleaner.register(service,channel::shutdownNow);
return service;

In our unit testing, there is a test like this:

@Test
void smoke ()
{
  final var foo = new GrpcProvider().open(locator);
  assertNotNull( foo );
  assertTrue( foo.exists("AES128") );
}

This test passes in Adoptium and Liberica 11 and 17 JDKs, but fails in Semeru 11 and 17 JDKs with a Channel shutdownNow invoked exception. Debugging confirms that this is happening because the Cleaner is calling the Runnable while the test is still running.

If we refactor the test to something like this:

volatile GrpcFoo foo;

@Test
void smoke ()
{
  foo= new GrpcProvider().open(locator);
  assertNotNull( foo );
  assertTrue( foo.exists("AES128") );
}

the test passes.

It seems that the object is being collected too soon.

@pshipton
Copy link
Member

@dmitripivkine @tajila fyi, pls take a look.

@dmitripivkine
Copy link

Is this test failing with -Xint too?

@dmitripivkine
Copy link

This is unlikely GC issue. Sounds as VM (or even JIT potentially). GC does not control set of the roots.

@pedrolamarao
Copy link
Author

-Xint did not change the result with semeru-17.0.9 for Windows 64 bits.

@pedrolamarao
Copy link
Author

Let me give you code a little less "redacted" to consider.

The Cleaner is defined here.

public interface KeysProvider
{
    Cleaner cleaner = Cleaner.create();

    // etc.
}

The channel is registered with the Cleaner in this method:

public final class GrpcProvider implements KeysProvider
{
    // other stuff

    @Override
    public Keys open (URI locator)
    {
       try 
       {
            // lots of peculiar configuration parsing

            final var builder = Grpc.newChannelBuilder(target,credentials);
            builder.defaultLoadBalancingPolicy("round_robin");
            builder.defaultServiceConfig(serviceConfig);

            final var channel = builder.build();
            final var service = new GrpcKeys( getLogger(GrpcKeys.class), newBlockingStub(channel) );
            cleaner.register(service,channel::shutdownNow);
            return service;
        }
        catch (Exception e)
        {
            throw new KeysException("not expected",e);
        }
    }
}

and the JUnit tests as above.

@tajila
Copy link
Member

tajila commented Sep 27, 2024

@babsingh Please take a look at this

@babsingh
Copy link
Member

babsingh commented Sep 27, 2024

@pedrolamarao Can you provide a micro-test that I can run locally to reproduce the failure? It's needed to generate artifacts for introspecting JVM behavior.

@babsingh
Copy link
Member

Below is the testcase that I derived from the above information. I am unable to reproduce the issue.

// Filename: Test2.java

import java.lang.ref.Cleaner;
import java.lang.ref.Reference;

public class Test2 {
        public Object getObject() {
                Cleaner cleaner = Cleaner.create();
                var obj = new String("a new string");
                Cleaner.Cleanable c = cleaner.register(obj, () -> { throw new RuntimeException("object cleaned"); } );
                return obj;
        }

        public static void main(String[] args) throws Throwable {
                Test2 test = new Test2();

                final var obj = test.getObject();

                System.out.println(obj);

                /* Invoke GC. */
                for (int i = 0; i < 100; i++) {
                        System.gc();
                        Thread.sleep(100);
                }

                System.out.println(obj);
        }
}

@babsingh
Copy link
Member

babsingh commented Sep 27, 2024

Assuming that KeysException is thrown when the test fails.
JVM option: -Xdump:java+system+heap:events=throw,filter=*/KeysException will generate the required artifacts.
Here is the doc for -Xdump for troubleshooting.

@pedrolamarao

  • Can you generate and provide us the artifacts with -Xdump in case it is not feasible to develop a micro-test?
  • Also, can you provide the java -version output of the JDK being employed in the test environment?
  • Did this failure happen during first time usage of OpenJ9? or Is this a regression i.e. OpenJ9 was running fine, then an upgrade led to the failure? If this is a regression, can you let us know the old version of OpenJ9 that was being used?

@pedrolamarao
Copy link
Author

The test is new; the Semeru JDK was not recently updated.

openjdk version "17.0.9" 2023-10-17
IBM Semeru Runtime Open Edition 17.0.9.0 (build 17.0.9+9)
Eclipse OpenJ9 VM 17.0.9.0 (build openj9-0.41.0, JRE 17 Windows 11 amd64-64-Bit Compressed References 20231017_568 (JIT enabled, AOT enabled)
OpenJ9   - 461bf3c70
OMR      - 5eee6ad9d
JCL      - 3699725139c based on jdk-17.0.9+9)

I have attached the -Xdump for gRPC's exception StatusRuntimeException which seems to me the interesting time point. I have attached only the small text files. If you require the large binary files, please advise how to share them.

javacore.20241001.151523.11480.0003.txt
javacore.20241001.151527.11480.0006.txt
javacore.20241001.151528.11480.0009.txt

@babsingh
Copy link
Member

babsingh commented Oct 1, 2024

@pedrolamarao

@pedrolamarao
Copy link
Author

Yes, it reproduces with version 17.0.12+7.
I have uploaded -Xdump files to the upload link provided.

Is -Xdump capable of breaking on arbitrary program points?
Could I set it to dump on a specific file/line location?

I may find time to try a minimal reproducer, but not soon.

@babsingh
Copy link
Member

babsingh commented Oct 1, 2024

Thanks for the core files. I will see if they pin-point the root cause.

-Xdump will need to be used in conjunction with jdb to produce core files at specific file/line locations. The below steps highlight this process.

Could I set it to dump on a specific file/line location?

You can set breakpoints through jdb.

Is -Xdump capable of breaking on arbitrary program points?

  • The application in jdb will need to be run with the following JVM option: export OPENJ9_JAVA_OPTIONS="-Xdump:java+system+heap:events=user".
  • jdb won't accept the above JVM option through the cmdline, and passing the JVM option through the OPENJ9_JAVA_OPTIONS env var should act as a workaround.
  • While the application is halted at a breakpoint via jdb, you can send a user signal to the Java process (e.g. kill -3 <PID>, SIGQUIT = 3).

@babsingh
Copy link
Member

babsingh commented Oct 1, 2024

@pedrolamarao Does the failure go away if we change the GC policy to -Xgcpolicy:nogc and -Xgcpolicy:optthruput? The default GC policy is gencon. More GC policies are here: https://eclipse.dev/openj9/docs/xgcpolicy.

@pedrolamarao
Copy link
Author

Test results with various gcpolicy:

  • gencon: fails
  • balanced: fails
  • optavgpause: fails
  • optthruput: fails
  • nogc: passes

@babsingh
Copy link
Member

babsingh commented Oct 3, 2024

@pedrolamarao Would it be possible for you to invoke System.gc() both before and after ClientCalls.getUnchecked (ClientCalls.java:249) and run the tests using Adoptium and Liberica JDKs? OpenJ9's garbage collection is known to be more aggressive compared to OpenJDK/Hotspot, particularly when it comes to object collection.

Since there might not be any active references to the object while the instance method is still executing, the object could potentially be eligible for collection by the GC. If manually invoking the garbage collector (via System.gc()) with Adoptium and Liberica JDKs results in the same issue, it could indicate that the object is indeed unreachable.

However, if this issue does not occur with these JDKs, we may need to delve deeper into OpenJ9. The core files only show the final state and don't provide insight into how the JVM reached that state. Further investigation through more active debugging (e.g., logging) may be required, though this approach is typically more tedious and time-consuming. That said, having a micro test that consistently reproduces the issue would significantly ease the process and help narrow down the root cause more quickly.

4XESTACKTRACE                at io/grpc/stub/ClientCalls.getUnchecked(ClientCalls.java:249) <--- Objects are UNREACHABLE
4XESTACKTRACE                at io/grpc/stub/ClientCalls.blockingUnaryCall(ClientCalls.java:167)
4XESTACKTRACE                at dev/prodist/keys/services/KeysGrpc$KeysBlockingStub.exists(KeysGrpc.java:879)
4XESTACKTRACE                at dev/prodist/keys/grpc/GrpcKeys.exists(GrpcKeys.java:73)
4XESTACKTRACE                at dev/prodist/keys/grpc/GrpcConfigurationTest.smoke(GrpcConfigurationTest.java:32)

@pedrolamarao
Copy link
Author

pedrolamarao commented Oct 7, 2024

We have a reproducer here: https://github.com/pedrolamarao/semeru-issue-93

With JAVA_HOME set to this:

PS P:\dev\semeru-issue-93> P:\jdk-17.0.12+7\bin\java -version
openjdk version "17.0.12" 2024-07-16
IBM Semeru Runtime Open Edition 17.0.12.1 (build 17.0.12+7)
Eclipse OpenJ9 VM 17.0.12.1 (build openj9-0.46.1, JRE 17 Windows 11 amd64-64-Bit Compressed References 20240716_793 (JIT enabled, AOT enabled)
OpenJ9   - 4760d5d320
OMR      - 840a9adba
JCL      - e70fd1fd64a based on jdk-17.0.12+7)

running ./gradlew test on this project fails.

@babsingh
Copy link
Member

babsingh commented Oct 8, 2024

@pedrolamarao I am getting the below errors. Am I missing steps or dependencies on my end?

Starting a Gradle Daemon, 1 incompatible and 1 stopped Daemons could not be reused, use --status for details
> Task :grpc:extractIncludeProto FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':grpc:extractIncludeProto'.
> Could not resolve all files for configuration ':grpc:compileProtoPath'.
   > Cannot resolve external dependency javax.annotation:javax.annotation-api:1.3.2 because no repositories are defined.
     Required by:
         project :grpc
   > Cannot resolve external dependency com.google.protobuf:protobuf-java:4.26.1 because no repositories are defined.
     Required by:
         project :grpc
   > Cannot resolve external dependency io.grpc:grpc-core:1.63.0 because no repositories are defined.
     Required by:
         project :grpc
   > Cannot resolve external dependency io.grpc:grpc-protobuf:1.63.0 because no repositories are defined.
     Required by:
         project :grpc
   > Cannot resolve external dependency io.grpc:grpc-services:1.63.0 because no repositories are defined.
     Required by:
         project :grpc
   > Cannot resolve external dependency io.grpc:grpc-stub:1.63.0 because no repositories are defined.
     Required by:
         project :grpc

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
> Get more help at https://help.gradle.org.

BUILD FAILED in 12s
3 actionable tasks: 3 executed

@pedrolamarao
Copy link
Author

Sorry, repositories must be coming from environment configuration.
I have added the necessary repository configuration to Gradle scripts, and validated in a separate environment.

babsingh added a commit to babsingh/openj9 that referenced this issue Oct 18, 2024
Marking the receiver object of an instance method ensures that it
is not collected by the garbage collector (GC) while the instance
method is executing.

If the receiver object is not marked, there is a risk that the GC
might collect it. In such cases, finalization mechanisms, such as
PhantomReferences and Cleaners, could indicate that the object has
been collected while it is still in use. This issue is resolved if
the receiver object is marked during the execution of its instance
method.

Fixes: ibmruntimes/Semeru-Runtimes#93

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
@babsingh
Copy link
Member

I have experimented with the RI to see if it collects the object early similar to OpenJ9. The RI never collected the object even after applying JVM options to make the RI's GC policy aggressive. This issue seems like a bug in OpenJ9 where the receiver object is collected while its instance method is executing. This happens because the receiver object is tagged as an I-slot. eclipse-openj9/openj9#20384 has been opened as a potential fix for this issue.

babsingh added a commit to babsingh/openj9 that referenced this issue Oct 31, 2024
Related: ibmruntimes/Semeru-Runtimes#93

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
babsingh added a commit to babsingh/openj9 that referenced this issue Oct 31, 2024
Related: ibmruntimes/Semeru-Runtimes#93

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
babsingh added a commit to babsingh/openj9 that referenced this issue Oct 31, 2024
Related: ibmruntimes/Semeru-Runtimes#93

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
babsingh added a commit to babsingh/openj9 that referenced this issue Oct 31, 2024
Related: ibmruntimes/Semeru-Runtimes#93

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
babsingh added a commit to babsingh/openj9 that referenced this issue Nov 6, 2024
Related: ibmruntimes/Semeru-Runtimes#93

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
babsingh added a commit to babsingh/openj9 that referenced this issue Nov 6, 2024
Related: ibmruntimes/Semeru-Runtimes#93

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
babsingh added a commit to babsingh/openj9 that referenced this issue Nov 12, 2024
j9localmap_LocalBitsForPC is more aggressive than
j9localmap_DebugLocalBitsForPC in the way it analyzes and marks local
variables as object references. Aggressiveness relates to determining
whether a local object is being used at a specific point in the
bytecode execution, and this helps optimize garbage collection and
runtime performance. In contrast, j9localmap_DebugLocalBitsForPC
adopts a conservative approach; it overestimates and plays safe by
marking objects alive even if they are not being used.

j9localmap_DebugLocalBitsForPC resolves the issue seen in
ibmruntimes/Semeru-Runtimes#93.

We have run the following benchmarks: Liberty DT8/DT10 Startup &
Footprint, JRuby and Nashorn. No performance difference is seen between
j9localmap_DebugLocalBitsForPC and j9localmap_LocalBitsForPC.

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
babsingh added a commit to babsingh/openj9 that referenced this issue Nov 12, 2024
j9localmap_LocalBitsForPC is more aggressive than
j9localmap_DebugLocalBitsForPC in the way it analyzes and marks local
variables as object references. Aggressiveness relates to determining
whether a local object is being used at a specific point in the
bytecode execution, and this helps optimize garbage collection and
runtime performance. In contrast, j9localmap_DebugLocalBitsForPC
adopts a conservative approach; it overestimates and plays safe by
marking objects alive even if they are not being used.

j9localmap_DebugLocalBitsForPC resolves the issue seen in
ibmruntimes/Semeru-Runtimes#93.

We have run the following benchmarks: Liberty DT8/DT10 Startup &
Footprint, JRuby and Nashorn. No performance difference is seen between
j9localmap_DebugLocalBitsForPC and j9localmap_LocalBitsForPC.

Since there are no side-effects of j9localmap_DebugLocalBitsForPC,
this PR makes it OpenJ9's default mapper.

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
@pshipton
Copy link
Member

pshipton commented Nov 12, 2024

The change caused problems and was reverted.
eclipse-openj9/openj9#20579

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants