-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix avoidable S3 race condition (#693) #735
Conversation
smart_open/s3.py
Outdated
if 'Error' in err.response and err.response['Error'].get('Code') == '404': | ||
logger.warning( | ||
"Encountered '404 Not Found' error for key #%i", | ||
key_no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it perhaps make sense to log the key, rather than the key_no? If anybody does care to see the log, I'm not sure the number would be as helpful in debugging a situation.
An argument could perhaps be made that it's not actually worthy of being logged as a warning. An end-user could read this as something being wrong or broken, when this is completely valid behavior that is to be expected from time to time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very good point. If a 404 error occurs during iter_bucket
, it is almost certainly not the users fault, but rather the fault of the internal list_keys
function. I agree that logging the error would just be confusing to the user. I've removed the warning in my latest commit. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me 👍
5e1db28
to
8b7d202
Compare
smart_open/s3.py
Outdated
while True: | ||
try: | ||
(key, content) = result_iterator.__next__() | ||
if True or key_no % 1000 == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any need for the True or
bit?
edit: I realize that was there before. I wonder why 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also wondering this while I was looking at the code. I am not sure why it is there, I can remove it if its confirmed that it is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @mpenkov knows more about this.
Thank you @RachitSharma2001 ! |
Motivation
Fixes #693
As proposed by @ronkorving, it may be common for the s3
iter_bucket
function to encounter404 not found
errors due to the s3list bucket
command sometimes returning recently deleted keys. So a workaround of this problem is to surpress 404 errors foriter_bucket
, and instead log them as warnings for the user to assess.Tests
All existing tests related to
iter_bucket
withins3.py
pass. I also added two new tests:test_iter_bucket_404
andtest_iter_bucket_non_404
. The first test simulates the scenario withiter_bucket
for a key throws a 404 error, and expects that error to be suppressed and for computation to carry on for the rest of the keys. The latter test simulates the scenario where a non 404 error occurs, in which case it should not be suppressed.