Retry on Broken Connection #5589

westonpace · 2024-04-04T15:34:41Z

Describe the bug
object_store appears to be fairly conservative when it comes to retrying errors:

The error is:

Generic GCS error: Error after 0 retries in 122.602µs, max_retries:10, retry_timeout:180s, source:error sending request for url (https://storage.googleapis.com/some-file.lance): connection error: unexpected end of file

Doing some reverse engineering we can see that this maps to:

reqwest::Error(kind = Kind::Request) // error sending request
hyper::Error(kind = Kind::Io) // connection error
std::io::Error(kind = Kind::UnexpectedEof) // unexpected end of file

I wonder if we were able to catch this in earlier versions of the retry logic because we had is_connect:

if let Some(e) = source.downcast_ref::<hyper::Error>() {
    if e.is_connect() || e.is_closed() || e.is_incomplete_message() {
        do_retry = true;
    }
}

Regrettably, is_connect seems to have disappeared with the latest version of hyper and now we have:

if let Some(e) = e.downcast_ref::<hyper::Error>() {
    do_retry = e.is_closed() || e.is_incomplete_message();
    break
}

Note that e.is_incomplete_message will be false in this case (this is hyper::Error(kind = Kind::Io) and not hyper::Error(kind = Kind::IncompleteMessage))

To Reproduce
Regrettably, I cannot easily reproduce this. I believe this is an error originating from the TCP stack (tokio) and so some kind of TCP emulator (or mock TCP implementation) would be needed.

Expected behavior
The request is retried

Additional context
It may be possible to catch this by further downcasting the source to std::io::Error and retrying if that succeeds. Or, maybe just retry anytime we can downcast to hyper::Error?

The text was updated successfully, but these errors were encountered:

tustvold · 2024-04-04T15:47:42Z

I wonder if we were able to catch this in earlier versions of the retry logic because we had is_connect:

This version hasn't been released yet, and even then #5536 which made this change moved this retry to a different location (which also might have fixed this).

I believe this is probably a duplicate of #5545

westonpace · 2024-04-04T16:37:09Z

We can close as duplicate if you'd like. Both are reqwest::Request error not being retried. I had opened a separate one since this is a GET and not a PUT. I'll comment over there.

westonpace added the bug label Apr 4, 2024

tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Apr 4, 2024

tustvold changed the title ~~object_store does not retry in certain cases where it should~~ Retry on Broken Connection Apr 4, 2024

westonpace closed this as completed Apr 4, 2024

therealsharath mentioned this issue May 20, 2024

build(deps): bump object_store from 0.9.1 to 0.10.1 apache/datafusion-python#707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on Broken Connection #5589

Retry on Broken Connection #5589

westonpace commented Apr 4, 2024 •

edited

Loading

tustvold commented Apr 4, 2024 •

edited

Loading

westonpace commented Apr 4, 2024

Retry on Broken Connection #5589

Retry on Broken Connection #5589

Comments

westonpace commented Apr 4, 2024 • edited Loading

tustvold commented Apr 4, 2024 • edited Loading

westonpace commented Apr 4, 2024

westonpace commented Apr 4, 2024 •

edited

Loading

tustvold commented Apr 4, 2024 •

edited

Loading