Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc-java libio_grpc_netty_shaded_netty_transport_native_epoll_x86_64 jvm crash after update to 1.45.1 #9083

Closed
lmcdasi opened this issue Apr 14, 2022 · 24 comments
Labels

Comments

@lmcdasi
Copy link

lmcdasi commented Apr 14, 2022

I'm having a weird issue when I execute an junit integration test after I upgrade grpc versions. It is getting a JVM crash on Linux only, while executing the same test case on Windows poses no issue. And I do not understand the reason.
Theoretically, I would expect the same behavior. It is the same mvnw clean verify -Djacoco.skip=false cmd on both OS'es.
Attached is the hs file.

At runtime it does not seem to pose any issue on a Linux OS. The junit is using 'grpc inProcess for both client/server'.

When using gpc version 1.40.1, my test passes successfully

[INFO] |  +- io.opentracing.contrib:opentracing-grpc:jar:0.2.3:compile
[INFO] |  +- net.devh:grpc-spring-boot-starter:jar:2.13.0.RELEASE:compile
[INFO] |  |  \- net.devh:grpc-server-spring-boot-starter:jar:2.13.0.RELEASE:compile
[INFO] |  |     \- net.devh:grpc-server-spring-boot-autoconfigure:jar:2.13.0.RELEASE:compile
[INFO] |  |        \- io.grpc:grpc-services:jar:1.42.1:compile
[INFO] |  +- io.grpc:grpc-protobuf:jar:1.43.2:compile
[INFO] |  |  +- com.google.api.grpc:proto-google-common-protos:jar:2.0.1:compile
[INFO] |  |  \- io.grpc:grpc-protobuf-lite:jar:1.43.2:compile
[INFO] |  +- io.grpc:grpc-stub:jar:1.43.2:compile
[INFO] +- net.devh:grpc-client-spring-boot-starter:jar:2.13.0.RELEASE:compile
[INFO] |  \- net.devh:grpc-client-spring-boot-autoconfigure:jar:2.13.0.RELEASE:compile
[INFO] |     \- net.devh:grpc-common-spring-boot:jar:2.13.0.RELEASE:compile
[INFO] +- io.grpc:grpc-netty-shaded:jar:1.40.1:compile
[INFO] +- io.grpc:grpc-core:jar:1.40.1:compile
[INFO] |  +- io.grpc:grpc-api:jar:1.40.1:compile (version selected from constraint [1.40.1,1.40.1])
[INFO] |  |  \- io.grpc:grpc-context:jar:1.40.1:compile

After upgrade:
hs_err_pid1033.log

$ ./mvnw dependency:tree | grep grpc
[INFO] |  +- io.opentracing.contrib:opentracing-grpc:jar:0.2.3:compile
[INFO] |  +- net.devh:grpc-spring-boot-starter:jar:2.13.0.RELEASE:compile
[INFO] |  |  \- net.devh:grpc-server-spring-boot-starter:jar:2.13.0.RELEASE:compile
[INFO] |  |     \- net.devh:grpc-server-spring-boot-autoconfigure:jar:2.13.0.RELEASE:compile
[INFO] |  |        \- io.grpc:grpc-services:jar:1.42.1:compile
[INFO] |  +- io.grpc:grpc-protobuf:jar:1.43.2:compile
[INFO] |  |  +- com.google.api.grpc:proto-google-common-protos:jar:2.0.1:compile
[INFO] |  |  \- io.grpc:grpc-protobuf-lite:jar:1.43.2:compile
[INFO] |  +- io.grpc:grpc-stub:jar:1.43.2:compile
[INFO] +- net.devh:grpc-client-spring-boot-starter:jar:2.13.0.RELEASE:compile
[INFO] |  \- net.devh:grpc-client-spring-boot-autoconfigure:jar:2.13.0.RELEASE:compile
[INFO] |     \- net.devh:grpc-common-spring-boot:jar:2.13.0.RELEASE:compile
[INFO] +- io.grpc:grpc-netty-shaded:jar:1.45.1:compile
[INFO] +- io.grpc:grpc-core:jar:1.45.1:compile
[INFO] |  +- io.grpc:grpc-api:jar:1.45.1:compile (version selected from constraint [1.45.1,1.45.1])
[INFO] |  |  \- io.grpc:grpc-context:jar:1.45.1:compile

Any ideas of what is wrong ?

@ejona86
Copy link
Member

ejona86 commented Apr 14, 2022

This seems likely caused by tcnative or epoll, as those are both JNI components so can cause crashes when things go awry. It looks like your tests do use Netty; I see InProcessOrAlternativeChannelFactory creating a Netty channel. It is unlikely to be tcnative because I don't see it in the logs and you'd most likely use plaintext in a test. Importantly, you are using grpc-netty-shaded and I don't see any other Netty usages (at least no other binary components), which removes multiple classes of potential problems.

Nothing jumps out immediately, but I've only glanced. I'll need to look deeper tomorrow.

@lmcdasi
Copy link
Author

lmcdasi commented Apr 14, 2022

This is the test case that fails on Linux but not on Windows. I stripped the assert sections ... For now I added @DisabledOnOs(value={OS.LINUX}). I used plaintext for the junit.

@ActiveProfiles("test")
@SpringBootTest(properties = {
        "grpc.server.inProcessName=test",
        "grpc.server.port=-1",
        "grpc.client.inProcess.address=in-process:test"
})
@ExtendWith(SpringExtension.class)
@TestPropertySource(locations={"classpath:application-test.yml"})
@ContextConfiguration(initializers={ConfigDataApplicationContextInitializer.class},
    classes = {CacheManager.class, GrpcClientAutoConfiguration.class, GrpcChannelConfigurer.class,
               GrpcChannelFactory.class}
)
@EnableConfigurationProperties(value = {GrpcConfigurer.class, ConnectorProperties.class})
@DirtiesContext
@DisabledOnOs(value={OS.LINUX})
@SuppressWarnings({"PMD.CommentDefaultAccessModifier", "PMD.DefaultPackage",
    "PMD.UnusedPrivateField", "PMD.UnnecessaryAnnotationValueElement"})
class GrpcConfigurerITTest {
    private static final String SINGLE_SERVICE_DOUBLE_METHOD = "single-service-double-method";
    private static final String SINGLE_SERVICE_SINGLE_METHOD = "single-service-single-method";
    private static final String SINGLE_SERVICE_EMPTY_METHOD = "single-service-empty-method";

    @GrpcClient(SINGLE_SERVICE_DOUBLE_METHOD)
    private ConnectorServiceGrpc.ConnectorServiceBlockingStub connectorServiceBlockingStub;

    @GrpcClient(SINGLE_SERVICE_SINGLE_METHOD)
    private ConnectorServiceGrpc.ConnectorServiceBlockingStub connectorServiceBlockingStubTwo;

    @GrpcClient(SINGLE_SERVICE_EMPTY_METHOD)
    private ConnectorServiceGrpc.ConnectorServiceBlockingStub connectorServiceBlockingStubThree;

    ......
    }

@lmcdasi
Copy link
Author

lmcdasi commented Apr 14, 2022

I have looked at runtime to see if the so libio_grpc_netty_shaded_netty_transport_native_epoll_x86 does appear and it looks like it does. But my runtime is an ubuntu image.

7f06b1800000-7f06b1810000 r-xp 00000000 fd:00 655529 /tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_645818328392020533487.so (deleted)
7f06b1810000-7f06b1a0f000 ---p 00010000 fd:00 655529 /tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_645818328392020533487.so (deleted)
7f06b1a0f000-7f06b1a11000 rw-p 0000f000 fd:00 655529 /tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_645818328392020533487.so (deleted)
maps.txt

I used that ubuntu image to build & the junit passes ok.

I noticed on the alpine that:
bash-5.0# ldd /tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_644589436771579381810.so
/lib/ld-musl-x86_64.so.1 (0x7fadd0b13000)
librt.so.1 => /lib/ld-musl-x86_64.so.1 (0x7fadd0b13000)
libdl.so.2 => /lib/ld-musl-x86_64.so.1 (0x7fadd0b13000)
libc.so.6 => /lib/ld-musl-x86_64.so.1 (0x7fadd0b13000)
Error relocating /tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_644589436771579381810.so: __strdup: symbol not found
Error relocating /tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_644589436771579381810.so: __strndup: symbol not found
bash-5.0#

Thus I think the issue is with the alpine image.

@lmcdasi
Copy link
Author

lmcdasi commented Apr 14, 2022

My issue is exactly the same as: #8751

grpc netty version 1.40 was running ok under musl but at least from 1.42 that compatibility has been lost.

@ejona86
Copy link
Member

ejona86 commented Apr 14, 2022

Ah, yeah, musl is being used. So you use musl during your test exception, but glibc for production? Note there was a solution/workaround for the musl issue: #8751 (comment)

@lmcdasi
Copy link
Author

lmcdasi commented Apr 14, 2022

I actually have two runtimes env's: one using ubuntu which has no issue and one using alpine. The build is done on an alpine img for both environments. I confused between both for few moments.

Since version 1.40.1 was working for musl ... it would be 'nice' to keep that backward compatibility ...

@ejona86
Copy link
Member

ejona86 commented Apr 14, 2022

Well, we never officially supported musl to begin with. It used to just be broken and there were some community repositories that would build musl .so's. And then musl did some glibc compat stuff that made it work. And we were okay with that. And then glibc upgrades broke it again (although there is the environment variable workaround). Given what I learned about the musl linker in #8751, personally I think it is now dead to me (personally speaking); that was too time consuming to figure out the cause of a mundane issue.

I saw the email notification where you mention -Dio.netty.transport.noNative=true. That very well might work, but you need to use the shaded name: -Dio.grpc.netty.shaded.io.netty.transport.noNative=true

@ejona86
Copy link
Member

ejona86 commented Apr 14, 2022

I guess also I learned that the libc-compat glue in Alpine just won't help us, because it only helps if the main binary was built for glibc but in these cases people are trying to use a native-musl java with a native-glibc .so and that combination doesn't work with the compat stuff automatically.

@briceburg
Copy link

I can confirm that using -Dio.grpc.netty.shaded.io.netty.transport.noNative=true avoids the segfault. Relying on glibc-compat is not a safe or wise thing to do... and thankfully J17 has a native musl port. combine that with this workaround, and I think we're OK for those forced to use an alpine base.

@lmcdasi
Copy link
Author

lmcdasi commented Apr 15, 2022

I'm not sure about '-Dio.grpc.netty.shaded.io.netty.transport.noNative=true' - as you noticed I have deleted from my comments regarding this setting.

The section of code where the crash occurs is related to a static code in java so I'm not sure that setting will work. Initially due to my mix environment I taught that it worked but after retries I realized that it does not actually. So I'll have to retry it myself once more & make sure that I use alpine ):

io/grpc/netty/Utils.java
static { // Decide default channel types and EventLoopGroup based on Epoll availability if (isEpollAvailable()) {

where:

Class .forName("io.netty.channel.epoll.Epoll") .getDeclaredMethod("isAvailable") .invoke(null);

So how can I disable the Epoll and failback to NioSocketChannel ?
In mvn what do I need to exclude in the pom file in order to have the io.netty.channel.epoll.Epoll missing ?!?

Do I miss something ?

@ejona86
Copy link
Member

ejona86 commented Apr 18, 2022

In mvn what do I need to exclude in the pom file in order to have the io.netty.channel.epoll.Epoll missing ?!?

That's not an option with grpc-netty-shaded. Epoll is included directly in that artifact. You would need to swap to using grpc-netty instead.

@lmcdasi
Copy link
Author

lmcdasi commented Apr 18, 2022

OK - thank you.

@lmcdasi
Copy link
Author

lmcdasi commented Apr 21, 2022

Something does not add-up. Practically I can do:

        if (channelBuilder instanceof NettyChannelBuilder) {
            LOGGER.info("Setting NettyChannelBuilder");
            final ThreadFactory eventLoopGroupThreadFactory = new DefaultThreadFactory("cc-grpc-nio-worker-ELG", true);
            final EventLoopGroup eventLoopGroup = new NioEventLoopGroup(0, eventLoopGroupThreadFactory);

            ((NettyChannelBuilder) channelBuilder).channelType(NioSocketChannel.class)
                    .eventLoopGroup(eventLoopGroup);
        }

In the grpc-java using GrpcChannelConfigurer.

That should allow me to avoid having the Utils static block kick-in which forces the EPoll.

So even tough I should have a choice between Nio/Epoll, actually I do not have.

That sounds like a defect to me. no ?!?

@ejona86
Copy link
Member

ejona86 commented Apr 21, 2022

So even tough I should have a choice between Nio/Epoll, actually I do not have.

I agreed with everything up unto this part. It seems you forgot to say what broke?

@lmcdasi
Copy link
Author

lmcdasi commented Apr 21, 2022

Well, in alpine, even if I set the channelType(NioSocketChannel.class) somehow the Utils class with it's static blocks triggers a vm crash because Epoll is not available.

So, to me, it looks like I cannot by-pass the Epoll even tough I set Nio.

Do I miss something ?

@lmcdasi
Copy link
Author

lmcdasi commented Apr 21, 2022

What is the point to have ".channelType(NioSocketChannel.class)" ?!? If I cannot set it ...

@lmcdasi
Copy link
Author

lmcdasi commented Apr 21, 2022

I do not think the NettyChannelBuilder is using properly the builder pattern.

The default Epoll/Nio event pool groups should be created ONLY if no one has been executed a channelType method.

They should not come by default loaded in memory.

So let's say in ubuntu, I will get by default the EPoll and then if I want to use Nio, I will end up having two classes Epoll & Nio.

I do not think I'm wrong.

@ngrigoriev
Copy link

I second @lmcdasi 's observation. NettyChannelBuilder statically refers to Utils class. Utils class has a static initialized calling isEpollAvailable() method. This method loads Epoll class (io.netty.channel.epoll.Epoll) through reflection and invokes isAvailable() method. This triggers the static initializer in Epoll that attempts to load the native library. And this is where the JVM crashes as the library crashes. There is not a single flag or system property allowing to break this chain. Or to provide a configurable name for that Epoll class. Or provide a factory. This chain is static.

@ejona86
Copy link
Member

ejona86 commented Apr 21, 2022

I see. There is a choice between Nio and Epoll threads and polling, but there isn't a way to avoid Epoll native library loading, except using grpc-netty instead of grpc-netty-shaded. io.grpc.netty.shaded.io.netty.transport.noNative gets you further because it avoids Epoll triggering the initialization of io.netty.channel.epoll.Native, as long as you don't call any other methods within io.netty.channel.epoll.

It would be possible to delay initialization of the builder fields to avoid the epoll initialization, but it doesn't seem to buy us too much as it would require all Alpine users to manage their own loops, lest they get a runtime crash. A crash is too horrible of a failure mode.

I think netty/netty#12272 will be the real fix here, as it does -Wl,-z,now which disables the delayed loading of symbols. That should cause dlopen to fail and let grpc handle this gracefully by falling back to NIO. With that in place it wouldn't be bad to avoid loading epoll when the application specifies the event loop, but the gains will be more minor at that point.

@sergiitk sergiitk assigned temawi and unassigned temawi May 3, 2022
@ejona86
Copy link
Member

ejona86 commented May 4, 2022

Seems like there's nothing more to do here. There will be a netty release containing the -z,now change and gRPC will upgrade to it in normal course. If there's something remaining, comment, and it can be reopened.

@ejona86 ejona86 closed this as completed May 4, 2022
@varpa89
Copy link

varpa89 commented May 5, 2022

@ejona86 could you clarify please, will it be just an exception instead of jvm crash or we could work without glibc (but with musl)?

@ejona86
Copy link
Member

ejona86 commented May 5, 2022

With the change to Netty, gRPC will fallback to other options. If you are running on OpenJDK 8u252 or later, then it will still work. Although performance may be lower compared to a glibc system, especially on Java 8.

@varpa89
Copy link

varpa89 commented May 30, 2022

@ejona86 when can we expect a new grpc release with updated netty please?

@ejona86
Copy link
Member

ejona86 commented May 31, 2022

Right now we are blocked from upgrading at least because of netty/netty-tcnative#716 . There was some java module stuff as well that we noticed, but I don't recall if that has already been fixed. #9027 is where we were going through iterations trying to upgrade.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants