Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Multipart uploads comitted on exception when using compression via context manager #684

Closed
Limess opened this issue Jan 17, 2022 · 10 comments · Fixed by #786
Closed

S3 Multipart uploads comitted on exception when using compression via context manager #684

Limess opened this issue Jan 17, 2022 · 10 comments · Fixed by #786
Labels

Comments

@Limess
Copy link

Limess commented Jan 17, 2022

Problem description

  • We're trying to upload gzipped objects to S3, using multipart upload
  • We're using a context manager to open a gzipped output file in S3, and writing to the file. When the writing throws an exception we expect the multipart upload to be terminated
  • The multipart upload is commited

This is because when using a context manager around a compressed file, the MultipartWriter __exit__ method will not be called on https://github.com/RaRe-Technologies/smart_open/blob/59d3a6079b523c030c78a3935622cf94405ce052/smart_open/s3.py#L938-L942

We've seen two resulting behaviours from this issue:

  1. Empty or incomplete multipart uploads are written, this seems to be as expected when __exit__ is not called correctly
  2. Corrupted gzipped files being written. We're less sure how this occured

Steps/code to reproduce the problem

import smart_open

with smart_open.open(
    s3.s3_uri(output_bucket, "test-file.gz"),
    "wb",
) as fout:
    for data in ["valid-data", Exception("An exception")]:
		if isinstance(data, Exception)
			raise data
    	fout.write((doc + "\n").encode("utf-8"))

Versions

>>> import platform, sys, smart_open
m.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)>>> print(platform.platform())
Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-debian-bullseye-sid
>>> print("Python", sys.version)
Python 3.7.11 (default, Jan 13 2022, 14:48:06)
[GCC 9.3.0]
>>> print("smart_open", smart_open.__version__)
smart_open 5.2.1
@mpenkov
Copy link
Collaborator

mpenkov commented Jan 17, 2022

Thank you for reporting this. Are you interested in making a PR?

@Limess
Copy link
Author

Limess commented Jan 17, 2022

Not right now, sorry.

@mpenkov mpenkov added the bug label Jan 17, 2022
@lociko
Copy link

lociko commented Jan 3, 2023

Any update? If not I can try to fix it

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 3, 2023

No updates on this. Yes, please go ahead.

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 3, 2023

I think the problem is here: https://github.com/RaRe-Technologies/smart_open/blob/18128f15e50c5ceda065a6d7d041eab6cb0933ad/smart_open/compression.py#L140

We should be making the result of compression_wrapper a context manager too, so it cleans up the underlying stream correctly.

Right?

@lociko
Copy link

lociko commented Jan 4, 2023

I think is not enough. We already have a tweak_close:

def tweak_close(outer, inner):
    """Ensure that closing the `outer` stream closes the `inner` stream as well.

    Use this when your compression library's `close` method does not
    automatically close the underlying filestream.  See
    https://github.com/RaRe-Technologies/smart_open/issues/630 for an
    explanation why that is a problem for smart_open.
    """
    outer_close = outer.close

    def close_both(*args):
        nonlocal inner
        try:
            outer_close()
        finally:
            if inner:
                inner, fp = None, inner
                fp.close()

    outer.close = close_both

Which close underlying streams but in the case of S3 MultipartWriter we want to terminate (abort) uploading if an error is raised.
Like in this pice of code:
https://github.com/RaRe-Technologies/smart_open/blob/59d3a6079b523c030c78a3935622cf94405ce052/smart_open/s3.py#L938-L942

So I see that when an exception is thrown tweak_close is called, but MultipartWriter just finish uploading right nothing is happen. I think is the main problem
When we exception we should call MultipartWriter.terminate to abort uploading

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 4, 2023

That sounds right. Is tweak_close able to determine when it is being called as part of exception handling?

If not, we may have to look at the stack.

@lociko
Copy link

lociko commented Jan 4, 2023

@mpenkov I'm not sure how to determine when it is being called as part of exception handling or not so I decided to use sys.exc_info().

@ddelange
Copy link
Contributor

ddelange commented Oct 5, 2023

similar issue with Azure #783

@ddelange
Copy link
Contributor

ddelange commented Oct 5, 2023

@mpenkov I'm not sure how to determine when it is being called as part of exception handling or not so I decided to use sys.exc_info().

using sys.exc_info() will cause erroneous behaviour: if you want to write a log file to s3 during handling of some other upstream exception, you're inside an exception context (and you still want your multipart upload of the log file to succeed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants