Retrying on connection closed #104

jbransen · 2025-01-06T09:43:39Z

I am running this library in a production setting, and on a regular basis I observe the following error:

RpcError(Status { code: Cancelled, message: "operation was canceled", source: Some(tonic::transport::Error(Transport, hyper::Error(Canceled, "connection closed"))) })

When reading the BigTable connection pool documentation this seems to be expected, as connections are refreshed every hour. However, as a client user I am suprised: why does this error make it's way to the end user? Is that a concious choice or is this an oversight? I would expect this library to gracefully handle this by retrying (which should always work fine), so that clients never end up with such errors.

The text was updated successfully, but these errors were encountered:

liufuyang · 2025-01-06T15:01:17Z

Hey @jbransen, thanks for raising this issue. You might be the first person that uses this crate in a production environment. I created this repo/crate while I worked a Spotify - at that moment I was thinking that perhaps it would be handy in case we wanted to try using Rust in some of our backend systems, as many of the backend systems needed to talk to Bigtable (however, that never really happened before I left Spotify). And I do remember during those days, the Spotify engineers needed to add Bigtable channel refreshing (client-swapping) manually in their code base to deal with the hourly spikes, and perhaps these days those "auto-refresh" feature gets merged into the standard Bigtable Java client side (as it described from the doc you linked); For the Golang side, the doc you linked points to some simple Golang example code that does the same thing, but requires users to implement those as well, I suppose.

So basically, without looking into all the details, if one wishes to have similar features (client refreshing), one might just need to implement it in a similar fashion in Rust. On your side, you might make it work by manually creating a new BigTableConnection at some period and replace the old one, periodically. Or, as you said, this could be done by BigTableConnection internally and providing some settings/flags to turn on and configure this feature.

Unfortunately, I do not use this crate in my work, nor work with Bigtable anymore. So even if I try to implement a feature like this, it might be a bit hard for me to try and test it out (not sure if the Bigtable emulator can do connection close, perhaps possible?).

But if you want, you are very welcome to contribute to this crate, add such feature, and test it out in your environment before merging.

Let me know if this helps :)

jbransen · 2025-01-06T19:03:02Z

Thank you for the fast reply, this helps. I am happy to craft a solution, try it out, and make a PR.

liufuyang · 2025-01-06T20:46:04Z

Oh cool, thanks a lot. I haven't been following up with the Rust GRPC development closely, not sure if on the tonic and tokio side whether there is some implementation off the shelve can help in this case. You might want to check the API there a bit or perhaps ask in the tonic Discord channel, people are helpful and I got a lot of good help while I was implementing this crate.

jbransen · 2025-01-07T11:23:44Z

Some (mostly open) discussions on retrying with gRPC:

Support gRPC Retry Design hyperium/tonic#1463
Unable to integrate tower retry with tonic because http::Request is not Clone hyperium/tonic#733
Retry middleware improvements for non-cloneable requests tower-rs/tower#790
gRPC sample clients should automatically re-connect and re-subscribe if connection is dropped rpcpool/yellowstone-grpc#38

liufuyang · 2025-01-07T12:23:14Z

Thanks. And in this case, I am not 100% sure the "retry-if-encounter-channel-close" is really what we need and can solve your issue gracefully. As from the Google doc linked, and with some public info such as this stackoverflow page, what I know is that the operation of reconnection to grpc to open warm channels are not cheap, so relying on any "retry" mechanism to fix your issue will cause hourly latency spikes to Bigtable.

What I mentioned above, and what the Google doc link's Golang example suggested, is not a "retry", but "keep creating a new client (and open new channels) and replace the old client to the new client, periodically" - so it is more about "auto client/channel refreshing" rather than any sort of retry.

liufuyang · 2025-01-07T12:26:24Z

And you might want to update the issue title from "Retrying on connection closed" to something like "Auto refreshing client to deal Bigtable server hourly grpc channel close"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrying on connection closed #104

Retrying on connection closed #104

jbransen commented Jan 6, 2025

liufuyang commented Jan 6, 2025 •

edited

Loading

jbransen commented Jan 6, 2025

liufuyang commented Jan 6, 2025

jbransen commented Jan 7, 2025

liufuyang commented Jan 7, 2025 •

edited

Loading

liufuyang commented Jan 7, 2025

Retrying on connection closed #104

Retrying on connection closed #104

Comments

jbransen commented Jan 6, 2025

liufuyang commented Jan 6, 2025 • edited Loading

jbransen commented Jan 6, 2025

liufuyang commented Jan 6, 2025

jbransen commented Jan 7, 2025

liufuyang commented Jan 7, 2025 • edited Loading

liufuyang commented Jan 7, 2025

liufuyang commented Jan 6, 2025 •

edited

Loading

liufuyang commented Jan 7, 2025 •

edited

Loading