Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.13.0.RELEASE or later crashes JVM during deployment to k8s #625

Open
savbiz opened this issue Jan 17, 2022 · 11 comments
Open

2.13.0.RELEASE or later crashes JVM during deployment to k8s #625

savbiz opened this issue Jan 17, 2022 · 11 comments
Labels
bug Something does not work as expected feedback required Information are missing or feedback for suggestions is requested

Comments

@savbiz
Copy link

savbiz commented Jan 17, 2022

The context

We would like to upgrade the dependency version of net.devh:grpc-server-spring-boot-starter in our Gradle build script for our microservice.

The bug

The application does not start if using version 2.13.0.RELEASE or later. That makes the deployment fail.

Stacktrace and logs

#   https://github.com/corretto/corretto-17/issues/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
#
# If you would like to submit a bug report, please visit:
# Core dump will be written. Default location: //core.1
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid1.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000003efe, pid=1, tid=7
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.1.12.1 (17.0.1+12) (build 17.0.1+12-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.1.12.1 (17.0.1+12-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C  0x0000000000003efe
#

Steps to Reproduce

It happens just right after deploying the application to our AWS QA K8S environment. Build, unit and integration tests work fine. Also running the application locally (with docker or through IntelliJ works).

The application's environment

Which versions do you use?

  • Spring (boot): 2.6.2
  • grpc-spring-boot-starter: 2.13.0.RELEASE, 2.13.1.RELEASE
  • kotlin: 1.6.10
  • JRE version: OpenJDK Runtime Environment Corretto-17.0.1.12.1 (17.0.1+12) (build 17.0.1+12-LTS)
  • Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.1.12.1

Additional context
With version 2.12.0.RELEASE everything works fine.

  • Did it ever work before?
    Yes, up to version 2.12.0.RELEASE.

  • Do you have a demo?
    No

@savbiz savbiz added the bug Something does not work as expected label Jan 17, 2022
@ST-DDT
Copy link
Collaborator

ST-DDT commented Jan 17, 2022

I'm not familiar with Corretto, so can you please try with a different JVM too please?
Which OS are you using underneath your Corretto JRE?
Does the error also happen if you start the application locally?

# An error report file with more information is saved as:
# /tmp/hs_err_pid1.log

Could you please include that log file or at least check it for relevant information?

Sorry for the trouble, but this error really isn't descriptive at all.

Since both 2.13.0 and 2.13.1 are affected, but 2.12.0 is not the issue might have been introduced somewhere in between 2.13.0 and 2.12.0. Unfortunately that's 116 commits/141 files changed. Which makes it near impossible to identify without further information.

Another (but potentially time expensive) alternative is, you building this project yourself and verifing whether the commit has the problem. E.g. build 2.13.0 yourself and try that, if the error persists, then try a commit somewhere in the middle between 2.12 and 2.13, thus limiting the range of potential changes. Repeating until you have a single (or close to a single) commit that causes the issue.

@ST-DDT ST-DDT added the feedback required Information are missing or feedback for suggestions is requested label Jan 17, 2022
@savbiz
Copy link
Author

savbiz commented Jan 18, 2022

Thanks for looking into this.

  • I'm not familiar with Corretto, so can you please try with a different JVM too please?
    Which JVM would you prefer us to try? Would a different base Docker image, (openjdk:17-alpine for example, be fine? Right now we are using amazoncorretto:17-alpine.

  • Which OS are you using underneath your Corretto JRE?
    Just Kubernetes (version 1.21)
    Platform version (AWS EKS - Amazon Elastic Kubernetes Service): eks.2

  • Does the error also happen if you start the application locally?
    No, locally it works fine starting the application on IntelliJ or local docker image ( which uses the same Dockerfile as our deployment tool does).

  • Could you please include that log file or at least check it for relevant information?
    Unfortunately, this is hard to retrieve since it is stored on our container which is ephemeral (meaning that when the deployment fails, it gets deleted right away). We'll have to find a way to intercept this file and print it to standard output.

@ST-DDT
Copy link
Collaborator

ST-DDT commented Jan 18, 2022

I'm not familiar with Corretto, so can you please try with a different JVM too please?

Which JVM would you prefer us to try? Would a different base Docker image, (openjdk:17-alpine for example, be fine? Right now we are using amazoncorretto:17-alpine.

OpenJDK or Eclipse Temurin would be good. This test is intended to identify whether this is specific to Corretto or the JVM in general.

Which OS are you using underneath your Corretto JRE?

Just Kubernetes (version 1.21)
Platform version (AWS EKS - Amazon Elastic Kubernetes Service): eks.2

I was actually referring to the OS inside the image, but you already answered that in the first block (alpine).
Please try also the plain Corretto base image amazoncorretto:17 (or similar) to rule out that this is a musl vs glibc issue.
I know these are much larger, but without additional information I don't know how to narrow it down.

Does the error also happen if you start the application locally?

No, locally it works fine starting the application on IntelliJ or local docker image ( which uses the same Dockerfile as our deployment tool does).

That is very strange. Might this be caused by differences in the config/properties?

Could you please include that log file or at least check it for relevant information?

  Unfortunately, this is hard to retrieve since it is stored on our container which is ephemeral (meaning that when the deployment fails, it gets deleted right away). We'll have to find a way to intercept this file and print it to standard output.

You might be able to do the following:

docker pull image
docker save image > image.tar
# open the tar and search for the entrypoint in the config
# append something like `|| cat /tmp/hs_err_pid1.log` to the command
docker load < image.tar
# maybe re-tag the image
docker push image

@hnxiaoyuan
Copy link

hnxiaoyuan commented Feb 14, 2022

I got a similar problem, and I checked the hs_err_pid.log,it seems not found the
libio_grpc_netty_shaded_netty_transport_native_epoll.so, attached the error log.

Instructions: (pc=0x0000000000003efe)
0x0000000000003ede:   
[error occurred during error reporting (printing registers, top of stack, instructions near pc), id 0xb]

Register to memory mapping:

RAX=0x0000000000000000 is an unknown value
RBX=0x000055c0c24f2d60 is an unknown value
RCX=0x0000000000000000 is an unknown value
RDX=0x0000000000000003 is an unknown value
RSP=0x00007f948129a018 is pointing into the stack for thread: 0x000055c0beddb800
RBP=0x00007f948129a050 is pointing into the stack for thread: 0x000055c0beddb800
RSI=0x0000000000000015 is an unknown value
RDI=0x000055c0c24f2d68 is an unknown value
R8 =0x00007f94698fc7d9: _fini+0x11c1 in /tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_645492642195867189808.so at 0x00007f94698f0000
R9 =0x8080808080808080 is an unknown value
R10=0x0000000000000000 is an unknown value
R11=0x0000000000000406 is an unknown value
R12=0x00007f94698fb794: _fini+0x17c in /tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_645492642195867189808.so at 0x00007f94698f0000
R13=0x0000000000000015 is an unknown value
R14=0x00007f948129a08c is pointing into the stack for thread: 0x000055c0beddb800
R15=0x00007f948129a2c0 is pointing into the stack for thread: 0x000055c0beddb800


Stack: [0x00007f948119f000,0x00007f948129fad0],  sp=0x00007f948129a018,  free space=1004k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x0000000000003efe
C  [libio_grpc_netty_shaded_netty_transport_native_epoll_x86_645492642195867189808.so+0xb487]  netty_jni_util_JNI_OnLoad+0x67
C  [libjava.so+0xeb24]  Java_java_lang_ClassLoader_00024NativeLibrary_load+0xb4
C  0x00000000f66afd78

@ST-DDT
Copy link
Collaborator

ST-DDT commented Feb 14, 2022

I got a similar problem

Also inside of aws or outside?

Do you actually use epoll or not?

@hnxiaoyuan
Copy link

  1. My application run in the docker container ,this error happened in test,when use the "mvn clen test", then the jvm crashed,I exclude the grpc-client-spring-boot-starter in my pom.xml, it works fine.
  2. Not use the epoll;

@ST-DDT
Copy link
Collaborator

ST-DDT commented Feb 14, 2022

J 1307  java.lang.Class.forName0(Ljava/lang/String;ZLjava/lang/ClassLoader;Ljava/lang/Class;)Ljava/lang/Class; (0 bytes) @ 0x00007f94725faeba [0x00007f94725fae40+0x7a]
J 4261 C1 java.lang.Class.forName(Ljava/lang/String;)Ljava/lang/Class; (15 bytes) @ 0x00007f9472f944cc [0x00007f9472f94320+0x1ac]
j  io.grpc.netty.shaded.io.grpc.netty.Utils.isEpollAvailable()Z+12
j  io.grpc.netty.shaded.io.grpc.netty.Utils.<clinit>()V+226
v  ~StubRoutines::call_stub
j  io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.<clinit>()V+26

I found this in the error log. And it looks like the io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder class initialization causes this.
https://github.com/grpc/grpc-java/blob/58a7ace6ac2041d3c9989a33fd46188ed56fed6a/netty/src/main/java/io/grpc/netty/Utils.java#L113

Could you update to grpc-java 1.44.0 and test again?

If the error persists, we have to open an issue upstream for help.
Please include the following information in that case:

  • grpc-java version
  • java version
  • os flavor and version

@fireXtract
Copy link

@ST-DDT
I find that the latest 1.13.1.RELEASE depends on DomainSocketAddress, regardless of if you use unix:// sockets in your config or not. Changing version to grpc-java 1.44.0 does not resolve.

Caused by: java.lang.NoClassDefFoundError: io/netty/channel/unix/DomainSocketAddress
	at net.devh.boot.grpc.client.autoconfigure.GrpcClientAutoConfiguration.nettyGrpcChannelFactory(GrpcClientAutoConfiguration.java:171)

@ST-DDT
Copy link
Collaborator

ST-DDT commented Feb 18, 2022

@fireXtract Could you please include the complete stacktrace?

There is no DomainSocketAddress in GrpcClientAutoConfiguration

Only in ShadedNettyChannelFactory and there only on a code path that is not actively used if not specifically configured.

If I have the entire stacktrace I can analyze whether I have to split it into two separate classes or just two different methods.

Do you use netty or netty shaded?

@ST-DDT
Copy link
Collaborator

ST-DDT commented Feb 19, 2022

I guess I have to ramp up my testExamples.sh once more to also test for this scenario.

@fireXtract From what I can tell, your error is not related to the original error.

@savbiz
Copy link
Author

savbiz commented Feb 28, 2022

I'm not familiar with Corretto, so can you please try with a different JVM too please?

Which JVM would you prefer us to try? Would a different base Docker image, (openjdk:17-alpine for example, be fine? Right now we are using amazoncorretto:17-alpine.

OpenJDK or Eclipse Temurin would be good. This test is intended to identify whether this is specific to Corretto or the JVM in general.

Which OS are you using underneath your Corretto JRE?

Just Kubernetes (version 1.21)
Platform version (AWS EKS - Amazon Elastic Kubernetes Service): eks.2

I was actually referring to the OS inside the image, but you already answered that in the first block (alpine). Please try also the plain Corretto base image amazoncorretto:17 (or similar) to rule out that this is a musl vs glibc issue. I know these are much larger, but without additional information I don't know how to narrow it down.

Does the error also happen if you start the application locally?

No, locally it works fine starting the application on IntelliJ or local docker image ( which uses the same Dockerfile as our deployment tool does).

That is very strange. Might this be caused by differences in the config/properties?

Could you please include that log file or at least check it for relevant information?

  Unfortunately, this is hard to retrieve since it is stored on our container which is ephemeral (meaning that when the deployment fails, it gets deleted right away). We'll have to find a way to intercept this file and print it to standard output.

You might be able to do the following:

docker pull image
docker save image > image.tar
# open the tar and search for the entrypoint in the config
# append something like `|| cat /tmp/hs_err_pid1.log` to the command
docker load < image.tar
# maybe re-tag the image
docker push image

Sorry for the late reply.
We have actually solved the original issue by following this workaround. It seems to be indeed a problem with Alpine images and k8 environments.
We are now able to use the latest version of your library.
Thanks for helping out though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something does not work as expected feedback required Information are missing or feedback for suggestions is requested
Projects
None yet
Development

No branches or pull requests

4 participants