Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 multipart uploads to text streams are committed on exception #763

Open
3 tasks done
nelhage opened this issue Mar 8, 2023 · 5 comments
Open
3 tasks done

S3 multipart uploads to text streams are committed on exception #763

nelhage opened this issue Mar 8, 2023 · 5 comments

Comments

@nelhage
Copy link

nelhage commented Mar 8, 2023

Problem description

If I smart_open an S3 file in text mode, write some data, and then raise an exception, I expect the multipart upload to be terminated, and no key to be created. That's what I see for binary-mode smart_open streams.

This is almost the same issue as #684, but in the context of using text streams, instead of compression. It suggests there might be some broader architectural fix, although I'm not sure what it would be.

Steps/code to reproduce the problem

Given this code:

with smart_open.open("s3://some_bucket/path/to/key.txt", "w") as fh:
  fh.write("hello\n")
  raise AssertionError()

I expect key.txt not to be created. Instead, I see it created with the content "hello\n".

For the almost-identical binary version:

with smart_open.open("s3://some_bucket/path/to/key.txt", "wb") as fh:
  fh.write(b"hello\n")
  raise AssertionError()

it works as intended.

The proximate problem is that while s3.MultipartWriter implements cancellation on exception, when we open in text mode, we wrap the object in a TextIOWrapper.

TextIOWrapper inherits from IOBase, and IOBase.__exit__ unconditionally calls self.close(), which completes the write and creates the partial object.

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
Linux-5.4.209-116.367.amzn2.x86_64-x86_64-with-glibc2.35
Python 3.10.8 (main, Feb 21 2023, 22:10:24) [GCC 11.3.0]
smart_open 6.3.0

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@donsokolone
Copy link
Contributor

donsokolone commented Apr 19, 2024

This issue affects SinglePartWriter as well.

The problem boils down to how garbage collection of both SinglepartWriter and MultipartWriter is being done. Both writers inherit from io.BufferedIOBase which calls self.close() on garbage collection when __del__() descriptor is invoked.

If writer was properly marked as closed on exception then calling del() by GC would not invoke close on the writer. Unfortunately current implementation of closed look like this:

@property
def closed(self):
    return self._buf is None

Hence it will always return True in case of unhandled exception. I suppose an easy way to fix it would be to set self._buf = None at the end of terminate() method of both writers.

@ddelange
Copy link
Contributor

ddelange commented Apr 19, 2024

The proximate problem is that while s3.MultipartWriter implements cancellation on exception, when we open in text mode, we wrap the object in a TextIOWrapper.

TextIOWrapper inherits from IOBase, and IOBase.__exit__ unconditionally calls self.close(), which completes the write and creates the partial object.

TextIoWrapper was patched in v7.0.0 ref 42682e7 (#783)

So using the with-statement will now also abort upload on exception when in text mode.

I think that fixes this issue

@donsokolone
Copy link
Contributor

donsokolone commented Apr 19, 2024

@ddelange regardless of patches for TextIOWrapper implemented in v7.0.0 close() will be called on both writers when garbage collected, resulting in unwanted writes to S3.

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 20, 2024

@donsokolone Can you think of a way we can avoid the unwanted writes?

@donsokolone
Copy link
Contributor

@mpenkov After doing some more troubleshooting I've created separate issue #819 as the problem I've mentioned affects only SinglepartWriter at the moment. Fix for that is quite simple and I will try to PR it today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants