Slow performance due to lack of buffering for GzipFile.write #760

svisser · 2023-02-17T20:49:08Z

Problem description

When writing a .csv.gz file to S3, performance was slow as it's calling GzipFile.write() for every single line in the CSV file. This is not the issue itself but there is also a lack of buffering in GzipFile.write() (see: python/cpython#89550) and I was able to improve performance by implementing a solution similar to python/cpython#101251 (i.e., register a custom compression handler for the '.gz' file extension in which GzipFile.write does have buffering).

I'm opening this issue for the smart_open library to discuss:

Should the smart_open library make it easier to enable and/or configure buffering for GzipFile.write() for Python versions that don't yet have the above fix?
If not, should we document this or make it clear to users of the smart_open library that performance can be improved by adding buffering yourself?

Steps/code to reproduce the problem

Use smart_open.open in write mode for a .csv.gz file to S3:

import csv

import smart_open

column_names = ("a", "b")
my_data = [.......]
with smart_open.open("s3://.../example.csv.gz", "wb", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=column_names)
    writer.writerows(my_data)

(writerows, in C, calls writerow for each line, which in turn call Gzip.write() for each line)

Versions

Python 3.8
smart_open 6.2.0

The text was updated successfully, but these errors were encountered:

rhpvorderman · 2023-05-17T07:53:59Z

What we do in xopen is simply put all GzipFile stuff in a io.BufferedWriter and return that to the user. Should be a one or two-liner fix for the smart_open people.

I'd like to plug xopen here, it has a much less broad focus than smart_open, but as a result it is very fast, especially for any gzip related work. I notice that the smart_open library does not use any accelerated libraries such as python-isal. Those can have great impact on performance.

mpenkov · 2023-05-18T09:34:10Z

@svisser Are you interested in making a PR to implement @rhpvorderman 's suggestion?

piskvorky · 2023-05-18T09:44:38Z

Is smart_open core the right place for such buffering? IMO that should either go directly in Python (as your CPython ticket suggests), or live in user-land as a custom registered extension handle (as you say).

But I can also see how that's a pretty common use-case, and smart_open does focus on performance. So it's not a bad fit. So I'm on the fence here. How far do we want to take this? Implement buffering / paralellized .bz2, .xz too? Support custom random-access seeking via igzip? Then we may end up maintaining a lot of compatibility / 3rd party API stuff.

Not an issue with just buffered write to .gz. Just thinking ahead about the core mission, scope.

mpenkov · 2023-05-18T11:51:43Z

I think just do the bare minimum to address the issue: add buffering to gzip as an edge case. Once the CPython guys fix this on their side we can remove the buffering on our side.

rhpvorderman · 2023-05-19T04:16:24Z

@mpenkov The lines in xopen are here https://github.com/pycompression/xopen/blob/cc75cbcd17130a4c004cf44c46121da51be582d8/src/xopen/__init__.py#LL1297C1-L1309C23

rbramley · 2023-12-01T15:59:39Z

This fix could be really important for my similar use case, streaming large gzipped tar files from an HTTPS endpoint to an S3 bucket. The throughput I've been seeing with smart_open (6.4.0 with Python 3.10) is inadequate, taking around 130 seconds per ~50MB part, meaning over 8 hours to transfer an 11GB file!

Previously we'd streamed files to temporary storage before performing a multipart upload to S3. This was very quick, given sufficient disk space, as there is a lot less overhead than with the current smart_open implementation. Tackling a new larger dataset prompted the move to chaining the streams as the cloud service has disk space constraints that are not sufficient for the largest files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance due to lack of buffering for GzipFile.write #760

Slow performance due to lack of buffering for GzipFile.write #760

svisser commented Feb 17, 2023 •

edited

Loading

rhpvorderman commented May 17, 2023 •

edited by mpenkov

Loading

mpenkov commented May 18, 2023

piskvorky commented May 18, 2023 •

edited

Loading

mpenkov commented May 18, 2023

rhpvorderman commented May 19, 2023

rbramley commented Dec 1, 2023

Slow performance due to lack of buffering for GzipFile.write #760

Slow performance due to lack of buffering for GzipFile.write #760

Comments

svisser commented Feb 17, 2023 • edited Loading

Problem description

Steps/code to reproduce the problem

Versions

rhpvorderman commented May 17, 2023 • edited by mpenkov Loading

mpenkov commented May 18, 2023

piskvorky commented May 18, 2023 • edited Loading

mpenkov commented May 18, 2023

rhpvorderman commented May 19, 2023

rbramley commented Dec 1, 2023

svisser commented Feb 17, 2023 •

edited

Loading

rhpvorderman commented May 17, 2023 •

edited by mpenkov

Loading

piskvorky commented May 18, 2023 •

edited

Loading