-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move retries lower and retry rename ops #82
Conversation
Instead or retrying the whole operation, retry the specific parts that can fail. I believe this also fixes cases where we might end up with a retry loop within a retry loop? I'm not sure.
Retry renames when the error wasn't due to a file not existing. This case can come up on windows when some other process opens the file for reading while we're trying to rename it.
@willscott could you check this logic thoroughly. I think I got everything right, but I'm not sure. |
for i := 0; i < RetryAttempts; i++ { | ||
err = os.Rename(tmpPath, path) | ||
// if there's no error, or the source file doesn't exist, abort. | ||
if err == nil || os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's worth going down this rabbit hole of understanding what classes of errors are possible here.
- it looks like you'll get an
EEXIST
error in some cases if you try renaming a file to itself. - renaming across disks (https://stackoverflow.com/questions/50740902/move-a-file-to-a-different-drive-with-go) can fail in some cases. I wonder what happens when your tmp dir is on a different disk than your main data cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're putting the temporary directory inside the datastore directory, so that shouldn't be an issue.
Really, we probably don't even need "is not exist", but we might as well have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were also retrying put/delete write operations, which are no longer retried. I think transient failures in those aren't as much as much of a problem, but noting the functional change.
Yeah. I don't think we should be retrying write operations, but we probably should retry remove operations. |
Co-Authored-By: Will <will.scott@protocol.ai>
Ok, I've added retries to the delete ops as well. |
No description provided.