Fix Client hanging on unexpected Channel closures #2610

kyri-petrou · 2024-01-11T10:36:13Z

/claim #2479
/fixes #2479

The fix to the flaky test might seem trivial but it actually identifies a much more severe issue with the client's connection pool.

When a channel is marked as Invalid, the invalidation happens via ZPool's invalidate method (see here). However, the method (as pointed out in the Scaladoc) doesn't guarantee that the invalidation will take place immediately. This means that once the scoped is closed (and the channel returned back to the pool) it is possible for an invalid channel to be picked out of the connection pool prior to it being actually invalidated.

I'm not 100% sure what is the best way forward with this - I think perhaps the issue should be addressed in ZIO, where ZPool should either expose an invalidateNow method, or mark items that have been invalidated "to be discarded", and ensure that they will not be used again.

Thoughts on this? Is this indeed an issue / limitation to be addressed in ZPool's implementation, or should we add some interim patch for this behaviour in zio-http?

kyri-petrou · 2024-01-11T11:36:34Z

Upon some further investigation, I'm beginning to suspect that this is related to how ZPool handles the "minimum" number of items in the pool

Without disabling the connection pool, using dynamicConnectionPool(x, 100, zio.Duration.fromMillis(10_000))

x = 0: Not flaky
x = 1: Very rarely flaky
x = 2: Sometimes flaky

By x=5 it's more or less guaranteed to be flaky

adamgfraser · 2024-01-11T15:19:25Z

I don't think this is correct regarding ZPool. invalidate is lazy in that the item may not be finalized immediately, but an item that has been invalidated will not be provided to a caller of get.

adamgfraser · 2024-01-11T15:37:47Z

I think you just need to use the live clock.

kyri-petrou · 2024-01-11T21:17:43Z

@adamgfraser my bad you're right regarding invalidate, I must have missed that in the implementation of ZPool. However, there is an issue with the connection pool, and it's not related to the test setup. Running the code below will never finish running, with Netty messages being printed continuously.

Note that disabling the connection pool or setting a dynamic one with a lower bound of 0 or 1 in the code below works though

import zio.*
import zio.http.*
import zio.http.netty.NettyConfig
import zio.http.netty.client.NettyClientDriver

object Main extends ZIOAppDefault {
  
  private val untrusted = URL.decode("https://untrusted-root.badssl.com/").toOption.get

  override def run = {
    ZIO.foreach(1 to 20) { _ =>
      ZIO.scoped[Client](Client.request(Request.get(untrusted))).exit.debug
    }
  }.provide(
    ZLayer.succeed(ZClient.Config.default.ssl(sslConfig)),
    Client.customized,
    NettyClientDriver.live,
    DnsResolver.default,
    ZLayer.succeed(NettyConfig.default),
  )

  val sslConfig = ClientSSLConfig.FromTrustStoreResource(
    trustStorePath = "truststore.jks",
    trustStorePassword = "changeit",
  )
}

adamgfraser · 2024-01-12T00:20:53Z

@kyri-petrou Can we create a reproducer that doesn't involve ZIO HTTP? I think we need to resolve the underlying issue versus disabling connection pooling.

kyri-petrou · 2024-01-12T01:19:18Z

@kyri-petrou Can we create a reproducer that doesn't involve ZIO HTTP? I think we need to resolve the underlying issue versus disabling connection pooling.

Agree 👍

kyri-petrou · 2024-01-12T07:10:48Z

Ok this took me some time (and a lot of banging my head on the wall) but I managed to isolate the issue.

The problem is that if a ZPool has a minimum size requirement, and the effect used to create the resource fails, then the .get() method hangs indefinitely.

The very interesting thing in this case is that initially the effect that creates the channel succeeds - it's only after some time that it starts failing. The reason for it is that netty connects to the host lazily, so during the initial creation of the pool the effect succeeds. Once we start invalidating entries however and need to create a new Channel, Netty (likely due to caching) immediately returns an error, and then the client hangs cause it's not able to obtain a channel from the ZPool.

I can provide a reproducer tomorrow demonstrating how Zpool hangs when the resource effect fails.

PS: I have a fix in mind that doesn't require changes to ZPool - I'll update the PR tomorrow. However I do think it's probably good if ZPool didn't indefinitely hang if it's unable to obtain the minimum amount of resources

vigoo · 2024-01-12T09:17:45Z

Nice investigation, thank you!

adamgfraser · 2024-01-12T13:25:27Z

Thanks! Will take a look!

adamgfraser · 2024-01-12T16:48:35Z

I'm not sure this is correct.

I created the following to reproduce the circumstances you describe. We create a pool with a minimum size and a resource that succeeds the first thousand times it is created but fails thereafter. We then get items from the pool a hundred thousand times, invalidating each one that is successfully created.

This program completes normally. It is obviously hard to prove a negative so maybe there is some other logic path that is creating an issue. But in general the pool maintains a minimum number of items, whether successful or not, to avoid exactly this issue. So definitely looking forward to seeing the reproducer. 😄

import zio._

sealed trait Item {
  def isValid: Boolean
}

object Item {
  def make: ZIO[Any, String, Item] =
    ZIO.suspendSucceed {
      if (counter.incrementAndGet() > 1000) ZIO.fail("Couldn't create item")
      else ZIO.succeed(new Item { def isValid = false })
    }

  private val counter = new java.util.concurrent.atomic.AtomicInteger(0)
}

object Example extends ZIOAppDefault {

  def invalidate(pool: ZPool[String, Item], item: Item): ZIO[Any, Nothing, Unit] =
    if (item.isValid) ZIO.unit
    else pool.invalidate(item)

  def worker(pool: ZPool[String, Item]): ZIO[Any, String, Unit] =
    ZIO.scoped {
      pool.get.flatMap { item =>
        invalidate(pool, item)
      }
    }

  val run =
    for {
      pool <- ZPool.make(Item.make, 2 to 10, 1.second)
      _    <- ZIO.sleep(1.second)
      _    <- ZIO.foreach(1 to 100000)(n => worker(pool).exit.debug(n.toString))
    } yield ()
}

kyri-petrou · 2024-01-13T00:07:00Z

@adamgfraser first of all, apologies for sending you down the wrong track. Seems that I was wrong on the cause of the connection pool hanging. Turns out it was caused by us manually invalidating the entries via the forked effect, which meant that entries kept invalidating themselves after being put in the pool.

I updated the PR that fixes the issue in these ways:

When we obtain a channel from the pool, we check if the channel is still open. If it's not, we invalidate the entry and retry up to the minimum number of entries + 1. While this works, I feel like this is a bit hacky, so I'm happy to hear if you can think of any better alternatives. I'm wondering whether ZPool could expose a constructor with a isAlive: A => IO[E, Boolean] argument that allows it to check if items in it are still alive.
When we call requestOnChannel we fork an effect that will fulfill the promises if the channel was closed. This listener is only run for non-websocket requests just in case there might be some unintentional interop happening here. I'm happy to have this run for websocket channels as well if we think that's safe

Please let me know what you think. While the PR is fixes the issue it's a bit rough around the edges so I'm happy to hear any recommendations you might have.

codecov-commenter · 2024-01-13T00:12:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2d267b5) 64.76% compared to head (8b76885) 64.87%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2610      +/-   ##
==========================================
+ Coverage   64.76%   64.87%   +0.11%     
==========================================
  Files         140      140              
  Lines        8488     8487       -1     
  Branches     1574     1567       -7     
==========================================
+ Hits         5497     5506       +9     
+ Misses       2991     2981      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adamgfraser · 2024-01-13T00:32:10Z

@kyri-petrou So I think the issue is more fundamental here. We need to invalidate the entries, or have confirmed that we do not need to invalidate them because we successfully cleaned them up, before we return them to the pool. Otherwise we are just creating a race condition when someone else uses them.

kyri-petrou · 2024-01-13T00:43:25Z

So I think the issue is more fundamental here. We need to invalidate the entries, or have confirmed that we do not need to invalidate them because we successfully cleaned them up, before we return them to the pool. Otherwise we are just creating a race condition when someone else uses them.

I think we could do that by adding a finalizer to the ZioNettyConnectionPool#get method. However, this won't fix the issue where a connection becomes invalid while it's in the pool. AFAICT, without changes to ZPool, the only way we can do that without causing the connection pool to hang is by checking whether the connection is open when we retrieve it from the pool

EDIT: @adamgfraser I added the finalizer to the .get method

adamgfraser · 2024-01-13T01:14:34Z

So I think we need to think about why the connection is invalid.

If the connection becomes invalid because of something we did with it when we used it before then we need to check for that before returning it to the pool. If we check the connection was successfully cleaned up after we used it and invalidate it if not then another user of the pool should never observe an invalid connection because of our prior use.

If the connection becomes invalid just because connections randomly become invalid at some frequency then that is a problem that the pool cannot solve. Even if we checked whether the connection was valid one moment it could become invalid the next. So if that is a possibility it has to be build into the use of the connection that it may essentially become invalid at any point and the user has to handle that, likely by invalidating the connection, returning it to the pool, and retrying.

kyri-petrou · 2024-01-13T01:22:12Z

If the connection becomes invalid just because connections randomly become invalid

This is the case we're facing here. There are a couple of reason that the connection might become invalid while in the pool, where we have no control over it:

Connection gets dropped (either by the server or other means)
Initial connection succeeded but failed afterwards - this is the case with SSL:
i. Initial connection to the server succeeds. Channel is successfully created.
ii. Channel is added to the pool
iii. Channel starts negotiating SSL and fails
iv. Channel is closed (while in the pool)

So if that is a possibility it has to be build into the use of the connection

This is what I had in mind when adding the check during .get as well as the forked effect that checks whether the connection is still open while executing the request

adamgfraser · 2024-01-13T01:26:18Z

Okay, so I think we have a situation where one part of the system is expecting guarantees that another part of the system does not and cannot provide.

Given what you said above it is not an accurate assumption that "any connection you get from the pool is valid" and in fact even if you get a connection and determine it is valid it may become invalid by the time you try to do anything with it. So the code that uses the connection needs to fail if the connection is invalid instead of suspending indefinitely.

kyri-petrou · 2024-01-13T01:32:03Z

So the code that uses the connection needs to fail if the connection is invalid instead of suspending indefinitely.

I agree. This is the purpose of this addition.

PS: I wasn't suggesting that this needs to be handled by ZPool in its entirety. I was only suggesting that it might be a good if it provided some guarantees that what it'll be retrieved from it is valid. Nonetheless, ZClient still needs to be robust enough to handle cases that the connection is dropped across the whole lifecycle of the request

adamgfraser · 2024-01-13T01:40:18Z

Right. I guess what I'm saying is that guarantee is not possible in the presence of nondeterministic failures because even if an item was valid immediately before being gotten it could be invalid by the time that it is gotten. So the best you can do is get an item and then check whether it is valid, though even then it could become invalid immediately after you check so you still have to handle the possibility of failure afterwards.

You could also have some other process that periodically gets items and invalidates them if necessary, though that would also be at most probabilistic.

Anyway, I think you're on the right track making the use of the connection robust to the connection becoming invalid.

adamgfraser · 2024-01-13T01:49:41Z

zio-http/src/main/scala/zio/http/netty/client/NettyConnectionPool.scala

+          if (channel.isOpen) ZIO.succeed(channel)
+          else invalidate(channel) *> ZIO.fail(None)
+        }
+        .retry(retrySchedule(minSize(key) + 1))


I'm not sure it makes sense to have this logic because we already know that the channel might become invalid immediately after we check it so we still need to fail if we use a channel that is invalid and so can handle all of the logic there.

Also if we do have something like this I don't think there is any logical relationship between the pool size and the number of times we should retry. This seems to be based on the idea that we're checking "all" the values in the pool before giving up but the items in the pool are potentially constantly changing so a fixed number of retries might make more sense here.

Yeah I think you might be right. I was thinking of covering a potential edge case where all the channels in the pool are closed at the same time, and there is only a single thread executing a request. However, I think it's likely better for the behaviour of the pool to be more deterministic by having a fixed number of retries. I used 3 (for no specific reason) but I'm happy to change it to something lower / higher

@adamgfraser I had to change the code back to having dynamic retries. I added a test that showcases that a fixed number of retries would cause issues in cases that all channels in the connection pool are closed (e.g., in case of server idle timeouts)

This seems to be based on the idea that we're checking "all" the values in the pool before giving up but the items in the pool are potentially constantly changing so a fixed number of retries might make more sense here.

PS: I agree with your point here, having the number of retries equal the max number of items that are possible to exist in the pool doesn't guarantee that we checked every single item. However, it does guarantee that the get effect of ZPool is executed at least once

adamgfraser · 2024-01-14T02:30:52Z

zio-http/src/test/scala/zio/http/ClientHttpsSpec.scala

@@ -49,11 +49,11 @@ object ClientHttpsSpec extends ZIOHttpSpec {
    test("respond Ok") {
      val actual = Client.request(Request.get(waterAerobics))
      assertZIO(actual)(anything)
-    },
+    } @@ ignore,


I think you can fix this one and the one below it by using a different URL. The water aerobics one has an expired SSL certificate.

kyri-petrou · 2024-01-14T03:39:33Z

@adamgfraser just going through the code to double-check everything and I stumbled across this. Do you recon we should only do onComplete.succeed(State.Invalid) here in order to avoid any race condition with the closeFuture listener?

adamgfraser · 2024-01-14T04:00:07Z

@kyri-petrou You're saying only call onComplete there as opposed to in the new place we add it? Definitely if it is sufficient to only complete it in one place that would be better but if there is some circumstance in which that is not going to complete it we need to think through that.

And we want to have a situation where we know by the time the promise is completed that user is completely done with the connection, since as soon as the promise is completed the connection may be reallocated to someone else who is working under the assumption that they have exclusive access to it.

kyri-petrou · 2024-01-14T07:08:19Z

@adamgfraser thanks a lot for the pointers. I think the PR is ready to be re-reviewed now. Just some notes on the most recent changes:

We unfortunately can't rely on a fixed number of retries when obtaining a channel from the pool. The test that I added in the NettyConnectionPoolSpec fails if the number of retries is less than the number of items in the pool
I changed the logic in ClientResponseStreamHandler so that the onComplete promise is fulfilled before we close the channel to avoid situations where we fulfilll the promise with an interruption from the closeFuture listener

I think perhaps we could benefit from adding a "ConnectionManager" that monitors items in the pool and invalidates them / refreshes them asynchronously, but I think we can flag that as a separate issue

adamgfraser

Good to merge when tests are passing!

kyri-petrou · 2024-01-14T19:58:16Z

@adamgfraser do you happen to have any idea why the 2 tests that I re-enabled (which are not related to this PR) are passing when I run them locally but fail with java.net.NoRouteToHostException when in CI?

My gut tells me it seems like an issue with the DNS configuration of the GHA workers. I tried different endpoints (google.com, example.com, jsonplaceholder.typicode.com/posts) and they all seem to be failing with the same error

kyri-petrou · 2024-01-14T21:08:17Z

OK I ended up using https://zio.dev for the test which now works

Disable connection pool in ClientHttpsSpec

18a8130

kyri-petrou requested review from amitksingh1490, jdegoes, vigoo, adamgfraser and 987Nabil as code owners January 11, 2024 10:36

algora-pbc bot mentioned this pull request Jan 11, 2024

Flaky ClientHttpsSpec Test #2479

Closed

algora-pbc bot added the 🙋 Bounty claim label Jan 11, 2024

Fix client hanging on channel connection failures

aaa9760

kyri-petrou changed the title ~~Disable connection pool in ClientHttpsSpec~~ Fix Client hanging on unexpected Channel closures Jan 12, 2024

Fork closeFuture listener to execute last

fc118fb

Fix flaky test

da0bfd9

Add finalizer to prevent closed channels being added back to the pool

913cfa8

Fix test assertion

149b68f

adamgfraser reviewed Jan 13, 2024

View reviewed changes

kyri-petrou added 2 commits January 14, 2024 12:33

Use static number of retries and ensure closeFuture listener is run

646d234

Remove test code

60fcb61

adamgfraser reviewed Jan 14, 2024

View reviewed changes

Enable other HTTPs tests

f3edc8f

kyri-petrou and others added 4 commits January 14, 2024 16:23

Dynamic CP retries & test

5e1bd5c

Fork channel release & complete onComplete before closing the context

1461809

Merge branch 'main' into fix-2479

f3442d6

Use google.com instead of example.com

9881929

Use the system DNS resolver in ClientHttpsSpec

4650adb

adamgfraser approved these changes Jan 14, 2024

View reviewed changes

kyri-petrou added 2 commits January 15, 2024 06:26

Try different endpoint for https tests

b13c2ab

Add host header to request

90364ba

Try ssl specs with zio.dev

8b76885

adamgfraser merged commit bc595f3 into zio:main Jan 15, 2024
14 checks passed

kyri-petrou deleted the fix-2479 branch January 15, 2024 05:19

kyri-petrou mentioned this pull request Jan 25, 2024

ZClient cannot recover after a network failure or connection timeouts #2653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Client hanging on unexpected Channel closures #2610

Fix Client hanging on unexpected Channel closures #2610

kyri-petrou commented Jan 11, 2024 •

edited

Loading

kyri-petrou commented Jan 11, 2024

adamgfraser commented Jan 11, 2024

adamgfraser commented Jan 11, 2024

kyri-petrou commented Jan 11, 2024 •

edited

Loading

adamgfraser commented Jan 12, 2024

kyri-petrou commented Jan 12, 2024

kyri-petrou commented Jan 12, 2024 •

edited

Loading

vigoo commented Jan 12, 2024

adamgfraser commented Jan 12, 2024

adamgfraser commented Jan 12, 2024

kyri-petrou commented Jan 13, 2024 •

edited

Loading

codecov-commenter commented Jan 13, 2024 •

edited

Loading

adamgfraser commented Jan 13, 2024

kyri-petrou commented Jan 13, 2024 •

edited

Loading

adamgfraser commented Jan 13, 2024

kyri-petrou commented Jan 13, 2024 •

edited

Loading

adamgfraser commented Jan 13, 2024

kyri-petrou commented Jan 13, 2024

adamgfraser commented Jan 13, 2024

adamgfraser Jan 13, 2024

kyri-petrou Jan 14, 2024

kyri-petrou Jan 14, 2024

kyri-petrou Jan 14, 2024

adamgfraser Jan 14, 2024

kyri-petrou commented Jan 14, 2024

adamgfraser commented Jan 14, 2024

kyri-petrou commented Jan 14, 2024

adamgfraser left a comment

kyri-petrou commented Jan 14, 2024 •

edited

Loading

kyri-petrou commented Jan 14, 2024

Fix Client hanging on unexpected Channel closures #2610

Fix Client hanging on unexpected Channel closures #2610

Conversation

kyri-petrou commented Jan 11, 2024 • edited Loading

kyri-petrou commented Jan 11, 2024

adamgfraser commented Jan 11, 2024

adamgfraser commented Jan 11, 2024

kyri-petrou commented Jan 11, 2024 • edited Loading

adamgfraser commented Jan 12, 2024

kyri-petrou commented Jan 12, 2024

kyri-petrou commented Jan 12, 2024 • edited Loading

vigoo commented Jan 12, 2024

adamgfraser commented Jan 12, 2024

adamgfraser commented Jan 12, 2024

kyri-petrou commented Jan 13, 2024 • edited Loading

codecov-commenter commented Jan 13, 2024 • edited Loading

Codecov Report

adamgfraser commented Jan 13, 2024

kyri-petrou commented Jan 13, 2024 • edited Loading

adamgfraser commented Jan 13, 2024

kyri-petrou commented Jan 13, 2024 • edited Loading

adamgfraser commented Jan 13, 2024

kyri-petrou commented Jan 13, 2024

adamgfraser commented Jan 13, 2024

adamgfraser Jan 13, 2024

Choose a reason for hiding this comment

kyri-petrou Jan 14, 2024

Choose a reason for hiding this comment

kyri-petrou Jan 14, 2024

Choose a reason for hiding this comment

kyri-petrou Jan 14, 2024

Choose a reason for hiding this comment

adamgfraser Jan 14, 2024

Choose a reason for hiding this comment

kyri-petrou commented Jan 14, 2024

adamgfraser commented Jan 14, 2024

kyri-petrou commented Jan 14, 2024

adamgfraser left a comment

Choose a reason for hiding this comment

kyri-petrou commented Jan 14, 2024 • edited Loading

kyri-petrou commented Jan 14, 2024

kyri-petrou commented Jan 11, 2024 •

edited

Loading

kyri-petrou commented Jan 11, 2024 •

edited

Loading

kyri-petrou commented Jan 12, 2024 •

edited

Loading

kyri-petrou commented Jan 13, 2024 •

edited

Loading

codecov-commenter commented Jan 13, 2024 •

edited

Loading

kyri-petrou commented Jan 13, 2024 •

edited

Loading

kyri-petrou commented Jan 13, 2024 •

edited

Loading

kyri-petrou commented Jan 14, 2024 •

edited

Loading