-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance due to lack of buffering for GzipFile.write #760
Comments
What we do in xopen is simply put all GzipFile stuff in a io.BufferedWriter and return that to the user. Should be a one or two-liner fix for the smart_open people. I'd like to plug xopen here, it has a much less broad focus than smart_open, but as a result it is very fast, especially for any gzip related work. I notice that the smart_open library does not use any accelerated libraries such as python-isal. Those can have great impact on performance. |
@svisser Are you interested in making a PR to implement @rhpvorderman 's suggestion? |
Is smart_open core the right place for such buffering? IMO that should either go directly in Python (as your CPython ticket suggests), or live in user-land as a custom registered extension handle (as you say). But I can also see how that's a pretty common use-case, and smart_open does focus on performance. So it's not a bad fit. So I'm on the fence here. How far do we want to take this? Implement buffering / paralellized Not an issue with just buffered write to |
I think just do the bare minimum to address the issue: add buffering to gzip as an edge case. Once the CPython guys fix this on their side we can remove the buffering on our side. |
This fix could be really important for my similar use case, streaming large gzipped tar files from an HTTPS endpoint to an S3 bucket. The throughput I've been seeing with smart_open (6.4.0 with Python 3.10) is inadequate, taking around 130 seconds per ~50MB part, meaning over 8 hours to transfer an 11GB file! Previously we'd streamed files to temporary storage before performing a multipart upload to S3. This was very quick, given sufficient disk space, as there is a lot less overhead than with the current smart_open implementation. Tackling a new larger dataset prompted the move to chaining the streams as the cloud service has disk space constraints that are not sufficient for the largest files. |
Problem description
When writing a
.csv.gz
file to S3, performance was slow as it's callingGzipFile.write()
for every single line in the CSV file. This is not the issue itself but there is also a lack of buffering inGzipFile.write()
(see: python/cpython#89550) and I was able to improve performance by implementing a solution similar to python/cpython#101251 (i.e., register a custom compression handler for the'.gz'
file extension in whichGzipFile.write
does have buffering).I'm opening this issue for the smart_open library to discuss:
GzipFile.write()
for Python versions that don't yet have the above fix?Steps/code to reproduce the problem
Use
smart_open.open
in write mode for a.csv.gz
file to S3:(
writerows
, in C, callswriterow
for each line, which in turn callGzip.write()
for each line)Versions
The text was updated successfully, but these errors were encountered: