Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for broken retries with netty backend #1417

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cb-nezasa
Copy link

@cb-nezasa cb-nezasa commented Jul 21, 2021

withConnectionAttempts() does not work as expected with the Netty backend.
i.e. if a request is sent before the server is started, the caller sees a
failure immediately and no retries are attempted.

With the Akka-Http backend, the same test works as expected, i.e. one of
the retries will reach the server once it's up and the caller will see
a success result.

This PR does not introduce any changes to code but adds a test case exposing the issue and the different behaviour of the Netty and Akka-Http backend.

notes:
Unfortunately I could not find the root cause of this problem. In a different project I observed the debug log messages from akka.grpc.internal.ChannelUtils#monitorChannel which suggested that exponential backoff is working but nothing actually gets retried and the caller already got back a failed Future anyway.

withConnectionAttempts() does not work as expected with the Netty backend:
if a request is sent before the server is started, the caller sees a
failure immediately and no retries are attempted.

with the Akka-Http backend, the same test works as expected, i.e. one of
the retries will reach the server once it's up and the caller will see
a succes result.
@lightbend-cla-validator

Hi @cb-nezasa,

Thank you for your contribution! We really value the time you've taken to put this together.

Before we proceed with reviewing this pull request, please sign the Lightbend Contributors License Agreement:

https://www.lightbend.com/contribute/cla

@cb-nezasa
Copy link
Author

cb-nezasa commented Jul 22, 2021

One of my colleagues found a tentative fix / workaround to make my test case pass even for Netty - see below. The fix (or workaround) is inspired by this post grpc/grpc-java#5724

The fix really tweaks a lot of internas on the NettyChannelBuilder. Based on the scaladoc of akka.grpc.GrpcClientSettings#withConnectionAttempts I would not have expected such extra config to be necessary but rather have assumed that akka-grpc would (internally) take care of the details of configuring the netty channel. But maybe I'm missing some important point or there would be simpler ways to get the same result?

      import scala.jdk.CollectionConverters._
      val retryPolicy = Map(
        "maxAttempts" -> Double.box(5),
        "initialBackoff" -> "1s",
        "maxBackoff" -> "30s",
        "backoffMultiplier" -> Double.box(2),
        "retryableStatusCodes" -> List("UNAVAILABLE").asJava).asJava
      val methodConfig =
        Map(
          "name" -> List(Map("service" -> "helloworld.GreeterService").asJava).asJava,
          "retryPolicy" -> retryPolicy).asJava
      val serviceConfig = Map("methodConfig" -> List(methodConfig).asJava).asJava

      val client = GreeterServiceClient(
        GrpcClientSettings
          .usingServiceDiscovery("greeter", discovery)
          .withTls(false)
          .withConnectionAttempts(20)
          .withChannelBuilderOverrides(_.enableRetry().defaultServiceConfig(serviceConfig)))

Note: patienceConfig at the top of the test class might need to be increased or the backoff params tweaked a bit so there's enough time for the scheduled retries to consistently succeed.

@jtjeferreira
Copy link
Contributor

hi @raboof @jrudolph

do you have any tips on how to fix this?

@raboof
Copy link
Member

raboof commented Aug 9, 2021

do you have any tips on how to fix this?

It would need some digging - I think we have some retry logic in Akka gRPC, and there is retry logic in Netty itself. We should probably:

  • figure out why this scenario isn't triggering either of the existing retry mechanisms
  • decide whether this scenario would best/easiest covered on the Akka gRPC or the Netty side (possibly by passing a service config to the Netty channel builder like @cb-nezasa did above, but 'inside' the Akka gRPC runtime)
  • check if this means we should remove retry logic on the 'other' side

One thing that's important to keep in mind (and probably already under test) is that we should make sure not to only retry making the actual connection, but also possibly re-trigger the discovery, in case we need to refresh the service information to learn about an endpoint that will actually work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants