Skip to content

Check for additional IO errors that should be retried #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hachikuji
Copy link

I have been testing a slatedb-based database using antithesis with minio as the s3 backend. Antithesis is able to simulate various connection/transport issues which are currently causing test failures. Here is an example of one of the errors that we see:

Error: "SlateDbError(ObjectStoreError(Generic { store: \"S3\", source: ListRequest { source: RetryError { method: GET, uri: Some(http://minio:9000/rs3-antithesis?&list-type=2&prefix=rs3%2Fb1a45157-e2f0-4698-be0e-5bf3a9b8e9d1%2F0%2Fmanifest%2F), retries: 0, max_retries: 18446744073709551615, elapsed: 13.543569863s, retry_timeout: 1800s, inner: Http(HttpError { kind: Unknown, source: reqwest::Error { kind: Request, source: hyper_util::client::legacy::Error(SendRequest, hyper::Error(Io, Os { code: 104, kind: ConnectionReset, message: \"Connection reset by peer\" })) } }) } } }))

Note that although max_retries is set to its limit and the timeout is set to 30min, the request failed without any retries after 13s. It looks like the underlying ConnectionReset comes to object_store as an Unknown error which aborts the retry logic.

I tested this patch as a workaround and it addressed the issue. Feedback on the patch or alternative suggestions are welcome.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way we could test this?

@@ -108,12 +108,15 @@ impl HttpError {
} else if e.is_timeout() {
kind = HttpErrorKind::Timeout;
}
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means that now we will potentially recategorize an error already categorized by hyper, this seems undesirable.

Copy link
Author

@hachikuji hachikuji Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problematic part is that we break regardless of whether any of the conditions (is_closed, is_timeout...) are true. It would probably be fine to only break if one of them is satisfied.

@tustvold tustvold marked this pull request as draft April 21, 2025 15:27
@tustvold
Copy link
Contributor

Marking as draft as I don't believe this is currently waiting on review, feel free to mark ready for review when you would like someone to take another look.

@hachikuji hachikuji force-pushed the fix-retry-io-errors branch from 7faceef to 9bc2c13 Compare April 24, 2025 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants