-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc-netty breaks with Netty 4.1.111.Final #11284
Comments
You wrote a lot, but very little about what we actually need to be told. The tests in grpc-java run fine with 4.1.111. We could run the wider tests that we do when upgrading Netty, but right now this report is essentially useless to us. Could you at least tell us what you were doing when it hanged, whether it was a reliable hang or random, how long of a hang it is (1s, 100s?)? Since you are controlling the Netty version, it'd be helpful to know if 4.1.110 had the issue as well. As a hint, in the past when we've seen random hangs on upgrade, it generally meant you need a newer netty-tcnative or it was a netty-tcnative bug. |
Note: I was testing on Linux, so maybe this is a Mac problem. |
Sure, it doesn't have relevant details. You can ignore it in the meantime. :) I created the issue also to see if others are running into similar issues in the upgrade. I'll share more relevant details when I get back to debugging the issue. |
@ejona86 Here's the reproducer with more details: https://github.com/lhotari/pulsar/blob/lh-grpc-netty-4.1.111/README.md There are 2 cases which fail:
The libraries that are used under the covers are jetcd 0.8.2, vert-grpc 4.5.8, grpc-java 1.64.0 (versions aligned with grpc-bom) and Netty 4.1.111.Final (versions aligned with netty-bom, handles also other libs such as tcnative, it's 2.0.65.Final). I also tested on my Linux laptop (x86_64) and the problem reproduces in the same way as on Mac OS (arm64). |
another datapoint: in our maven build we have this in the root pom:
when upgrading to
and the tests eventually time out... upgrading to |
DO NOT USE NETTY 4.1.111 WITH GRPC. It appears there is corruption. There's no guarantees the corruption would be detected. @lhotari, thanks for the reproducer. Unfortunately I can't easily run it because it requires docker. I've got podman and don't have time to fight getting rootless docker or spinning up a new VM at the moment. Based on the logs (thanks!), I'm worried this is corruption.
|
Given the recent changes to ByteBufs and there's now corruption, I'm strongly suspicious of gRPC's NettyAdaptiveCumulator being a contributing factor. Once I can reproduce this, the first thing to try is to disable the cumulator and see if the problem resolves. |
I've posted a warning to grpc-io, updated the 1.64.0 release notes, and notified Netty folks. |
I suspect NettyAdaptiveCumulator got broken by netty/netty#14093. grpc-java/netty/src/main/java/io/grpc/netty/NettyAdaptiveCumulator.java Lines 158 to 174 in 889893d
Without this fix, we saw a rare, but persistent HTTP2 frame corruption in certain situations. Similar to this one. We should consider how to make it work with pre-4.1.111 and post-4.1.111 Netty versions. The NettyAdaptiveCumulator itself was introduced in #9558, and released in v1.51.0. At the moment unshaded Netty 4.1.111.Final can't be used with gRPC >= v1.51.0. |
netty/netty#12844 has very detailed description on what's wrong with grpc-java/netty/src/main/java/io/grpc/netty/NettyAdaptiveCumulator.java Lines 172 to 176 in 889893d
netty/netty#14093 changes that, and apparently now the indexes we're correcting are from the slice duplicate are wrong. Probably because unwrapping was removed, but it's too early for conclusions. I won't be surprised if the root cause was actually fixed by that pr. Relevant excerpt from netty/netty#12844 - read the issue description for the full context. Let's explore the result of
Note that the component still contains the bytes discarded by the
What's even more interesting, if we call
|
@larry-safran informed me unit tests did fail with 4.1.111. I re-tried (twice; third time was the charm) and now I'm seeing the failures. And they are all in NettyAdaptiveCumulatorTest. I've also been able to see a So this really seems triggered by netty/netty#14093, but only because of the workaround for netty/netty#12844. We're probably going to need Netty version-detection logic to avoid going through a corrupting code path. And it seems well past time to fix netty/netty#12844. |
@ejona86 so just to summarize you agree that what we did in netty is correct ? I think it is but wanted to double check :) |
If you want another internal datapoint, you can look at failure in https://screenshot.googleplex.com/59NDiYjVoeWcd6N |
@normanmaurer, I think Netty's new "duplicating a slice returns a slice" behavior is appropriate. The previous behavior confused me. I'll need to add a comment to netty/netty#12844 . I think the failure is a short-term workaround became a permanent fix (until it broke). |
@ejona86 @larry-safran What are the release and back-port plans for the fix? |
## Description Netty 4.1.111.Final is incompatible with grpc-java < 1.65.0 and >= 1.51.0. It can introduce buffer corruption at any point, which if you're lucky is detected, but may also go unnoticed and result in garbage and waste of time. Wait until grpc-java releases a compatible version before updating to 4.1.111.Final. Note that this also blocks Spring Boot 3.3.1 which depends on Netty 4.1.111.Final. See: - grpc/grpc-java#11284 (comment) - https://github.com/grpc/grpc-java/releases/tag/v1.64.0 ## Related issues blocks #19581 blocks #19265 blocks #19264 blocks #19259 blocks #19258
## Description Netty 4.1.111.Final is incompatible with grpc-java < 1.65.0 and >= 1.51.0. It can introduce buffer corruption at any point, which if you're lucky is detected, but may also go unnoticed and result in garbage and waste of time. Wait until grpc-java releases a compatible version before updating to 4.1.111.Final. Note that this also blocks Spring Boot 3.3.1 which depends on Netty 4.1.111.Final. See: - grpc/grpc-java#11284 (comment) - https://github.com/grpc/grpc-java/releases/tag/v1.64.0 ## Related issues blocks #19581 blocks #19265 blocks #19264 blocks #19259 blocks #19258
## Description Netty 4.1.111.Final is incompatible with grpc-java < 1.65.0 and >= 1.51.0. It can introduce buffer corruption at any point, which if you're lucky is detected, but may also go unnoticed and result in garbage and waste of time. Wait until grpc-java releases a compatible version before updating to 4.1.111.Final. Note that this also blocks Spring Boot 3.3.1 which depends on Netty 4.1.111.Final. See: - grpc/grpc-java#11284 (comment) - https://github.com/grpc/grpc-java/releases/tag/v1.64.0 ## Related issues blocks #19581 blocks #19265 blocks #19264 blocks #19259 blocks #19258
v1.65.0 is available with the Netty 4.1.111 fix. The other releases are still in-progress. |
### What changes are proposed in this pull request? add back netty dependency within grpc ### Why are the changes needed? previously we exclude netty dependency since we already have netty-all outside #9942 But due to grpc/grpc-java#11284, we sometimes have incompatibility between grpc and netty, and it's better to use shaded netty within grpc so we can be sure that they are compatible. ### Does this PR introduce any user facing changes? na pr-link: #18642 change-id: cid-65d86f315e023592060b6a9f864bfe2a972dfe68
Were other releases completed? |
For Netty 4.1.111 compat, it is best to use grpc-java 1.63.2, 1.64.2, 1.65.1, and later. We changed how we did the compat fix between 1.65.0 and 1.65.1. |
Added by ejona86: CORRUPTION. See #11284 (comment)
What version of gRPC-Java are you using?
1.60.2, 1.64.0
What is your environment?
What did you expect to see?
I'd expect grpc-netty to work with Netty 4.1.111.Final . I understand that grpc-java currently supports Netty 4.1.100.Final, however many libraries depend on grpc-netty and cannot switch to use grpc-netty-shaded without relocating all classes to use the packages where grpc-netty-shaded contains shaded version of Netty. Examples of such popular libraries are io.etcd:jetcd-core and io.vertx:vertx-grpc.
What did you see instead?
There weren't any log messages or exceptions, communications just timed out. There were some debug logging about attempts. I'll share more details later.
Steps to reproduce the bug
I'll share more details later. I maintain Apache Pulsar and Apache Bookkeeper projects and this problem showed up there when Netty got upgrades from 4.1.108.Final to 4.1.111.Final. We use an older grpc version, but I also tested with 1.60.2 and 1.64.0 and the problem remained. I also tried upgrading to 4.1.110.Final and that works fine with the test using jetcd-core that fails.
This means that Netty release 4.1.111.Final introduces a change that breaks grpc-netty.
These are the 4.1.111.Final release notes and changes:
The text was updated successfully, but these errors were encountered: