Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc-java upgrade from 1.53.0 to 1.54.0 crashes JVM on Alpine #10096

Closed
svenrienstra opened this issue Apr 24, 2023 · 6 comments
Closed

grpc-java upgrade from 1.53.0 to 1.54.0 crashes JVM on Alpine #10096

svenrienstra opened this issue Apr 24, 2023 · 6 comments
Milestone

Comments

@svenrienstra
Copy link

svenrienstra commented Apr 24, 2023

An attempt to upgrade from grpc-java 1.53.0 to 1.54.0 ends with JVM crash. I see this issue only running on docker using an Alpine image: eclipse-temurin:17-alpine. On my local machine (OSX, ARM) I can't reproduce this issue.

I'm getting this error:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000000000207d6, pid=1, tid=7
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C  0x00000000000207d6
#
# Core dump will be written. Default location: /core.%e.1.%t
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Full error log: java_error1.log

@ejona86
Copy link
Member

ejona86 commented Apr 24, 2023

Guessing, I'd say it is a linking problem. musl refuses to tell you "symbol can't be found" and instead just crashes. With glibc these issues are caught and handled. But I've fought it before #8751 (comment) , so now I know.

From the log, yeah, that's a very early failure.

C  0x00000000000207d6
C  [libio_grpc_netty_shaded_netty_tcnative_linux_x86_649515801029351396935.so+0x2500a]  _init+0x4cc2
C  [libio_grpc_netty_shaded_netty_tcnative_linux_x86_649515801029351396935.so+0x29b44]  netty_jni_util_JNI_OnLoad+0x74

Yep. Linking. Friends don't let friends use musl outside the embedded space, because it can't be bothered to give a useful error:

# apk add gcompat
# wget https://repo1.maven.org/maven2/io/netty/netty-tcnative-boringssl-static/2.0.56.Final/netty-tcnative-boringssl-static-2.0.56.Final-linux-x86_64.jar
# unzip netty-tcnative-boringssl-static-2.0.56.Final-linux-x86_64.jar
# LD_PRELOAD=/lib/libgcompat.so.0 ldd META-INF/native/libnetty_tcnative_linux_x86_64.so
	/lib/ld-musl-x86_64.so.1 (0x7f27ec69d000)
	/lib/libgcompat.so.0 => /lib/libgcompat.so.0 (0x7f27ec68a000)
	librt.so.1 => /lib/ld-musl-x86_64.so.1 (0x7f27ec69d000)
	libpthread.so.0 => /lib/ld-musl-x86_64.so.1 (0x7f27ec69d000)
	libdl.so.2 => /lib/ld-musl-x86_64.so.1 (0x7f27ec69d000)
	libc.so.6 => /lib/ld-musl-x86_64.so.1 (0x7f27ec69d000)
	libucontext.so.1 => /lib/libucontext.so.1 (0x7f27ec685000)
	libobstack.so.1 => /usr/lib/libobstack.so.1 (0x7f27ec680000)
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: sys_siglist: symbol not found

Newer versions of netty-tcnative (2.0.57 looks the same) have that fixed, but a new issue:

# wget https://repo1.maven.org/maven2/io/netty/netty-tcnative-boringssl-static/2.0.60.Final/netty-tcnative-boringssl-static-2.0.60.Final-linux-x86_64.jar
# unzip netty-tcnative-boringssl-static-2.0.60.Final-linux-x86_64.jar
# LD_PRELOAD=/lib/libgcompat.so.0 ldd META-INF/native/libnetty_tcnative_linux_x86_64.so
	/lib/ld-musl-x86_64.so.1 (0x7fc8f1335000)
	/lib/libgcompat.so.0 => /lib/libgcompat.so.0 (0x7fc8f1322000)
	libm.so.6 => /lib/ld-musl-x86_64.so.1 (0x7fc8f1335000)
	libc.so.6 => /lib/ld-musl-x86_64.so.1 (0x7fc8f1335000)
	ld-linux-x86-64.so.2 => /lib/ld-linux-x86-64.so.2 (0x7fc8f131c000)
	libucontext.so.1 => /lib/libucontext.so.1 (0x7fc8f1317000)
	libobstack.so.1 => /usr/lib/libobstack.so.1 (0x7fc8f1312000)
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_GetRegionStart: symbol not found
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_RaiseException: symbol not found
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_SetIP: symbol not found
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_GetLanguageSpecificData: symbol not found
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_GetTextRelBase: symbol not found
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_Resume_or_Rethrow: symbol not found
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_GetIPInfo: symbol not found
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_Resume: symbol not found
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_SetGR: symbol not found
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_DeleteException: symbol not found
Error relocating META-INF/native/libnetty_tcnative_linux_x86_64.so: _Unwind_GetDataRelBase: symbol not found

But looks like you can workaround it, just like glibc.

# apk add libunwind
# LD_PRELOAD=/lib/libgcompat.so.0:/usr/lib/libunwind.so.8 ldd META-INF/native/libnetty_tcnative
_linux_x86_64.so
	/lib/ld-musl-x86_64.so.1 (0x7fb688ab4000)
	/lib/libgcompat.so.0 => /lib/libgcompat.so.0 (0x7fb688aa1000)
	/usr/lib/libunwind.so.8 => /usr/lib/libunwind.so.8 (0x7fb688a89000)
	libm.so.6 => /lib/ld-musl-x86_64.so.1 (0x7fb688ab4000)
	libc.so.6 => /lib/ld-musl-x86_64.so.1 (0x7fb688ab4000)
	ld-linux-x86-64.so.2 => /lib/ld-linux-x86-64.so.2 (0x7fb688a83000)
	libucontext.so.1 => /lib/libucontext.so.1 (0x7fb688a7e000)
	libobstack.so.1 => /usr/lib/libobstack.so.1 (0x7fb688a79000)
	liblzma.so.5 => /usr/lib/liblzma.so.5 (0x7fb6885d6000)

So looks like a regular netty-tcnative upgrade will fix this.

@svenrienstra
Copy link
Author

Thanks for looking in to it @ejona86 ! So if I understand you correctly I should just wait for the next release and it should be fixed? (or use the workaround of course) No need to open a ticket over at Netty?

@ejona86
Copy link
Member

ejona86 commented Apr 25, 2023

Since you are using grpc-netty-shaded, yeah, just stay on 1.53. 1.55 won't have the Netty upgrade, but I expect 1.56 will. I don't think it will be too hard for us to upgrade netty-tcnative, but I don't want to do it just before the 1.55 release (a week away). When you do get on 1.56, I think you will need that workaround for libunwind.

Obviously, your other option is to stop using Alpine. The non-Alpine Temurin image is "just" 28% larger. Or if you are serious about container size you might be able to use distroless which is Debian-based but 35% smaller than Temurin's Alpine image (that was apples and oranges as it is JRE-only; a better comparison would be against eclipse-temurin:17-jre-alpine, of which distroless is larger).

@ejona86
Copy link
Member

ejona86 commented Jun 20, 2023

netty-tcnative-boringssl-static was upgraded to 2.0.61.Final in 1.56.0. That should fix the linking problem. (But you'll still need all that LD_PRELOAD for gcompat and unwind.)

@ejona86 ejona86 closed this as completed Jun 20, 2023
@vhscom
Copy link

vhscom commented Jun 26, 2023

Hit this issue on a eclipse-temurin:17-alpine last week when:

com.google.cloud:google-cloud-bigquery:2.24.3 -> 2.24.4 (c)

Debugging shows any version of io.grpc:grpc-netty-shaded:1.54.+ will cause the JVM to segfault. I was under the understanding LD_PRELOAD was a workaround for corner cases. Is it now required for use of grpc-java in Alpine?

@ejona86
Copy link
Member

ejona86 commented Jun 26, 2023

LD_PRELOAD has been needed for a while for gcompat. Now it is also needed for libunwind.

The problem is gcompat doesn't automatically trigger when using a musl-compiled java, because that is triggered by the linker name and the linker for the java process is the musl-named linker. gcompat is basically just a LD_PRELOAD shim, so the explicit LD_PRELOAD is a bit ugly, but no more hacky than trying to have glibc compat on alpine in general.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants