gh-103477: Read and write gzip header and trailer with zlib #103478

iii-i · 2023-04-12T18:07:42Z

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate and inflate performance on this platform by using a specialized CPU instruction.

This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead on s390x.

The reason is that Python needs to write specific values into gzip header; and when this support was introduced in year 1997, there was indeed no better way to do this.

Since v1.2.2.1 (2011) zlib provides inflateGetHeader() and deflateSetHeader() functions for that, so Python does not have to deal with the exact header and trailer format anymore.

Add new interfaces to zlibmodule.c that make use of these functions:

Add mtime argument to zlib.compress().
Add mtime and fname arguments to zlib.compressobj().
Add gz_header_mtime and gz_header_done propeties to ZlibDecompressor.

In Python modules, replace raw streams with gzip streams, make use of these new interfaces, and remove all mentions of crc32.

[1] madler/zlib#410

Issue: Read and write gzip header and trailer with zlib #103477

encukou · 2023-04-17T09:42:06Z

ping @Yhg1s @gpshead as zlib experts

iii-i · 2023-05-31T12:17:55Z

Rebased. Could someone have a look please?

Tests / Hypothesis tests on Ubuntu seems to be independent and is affecting other PRs as well.

iii-i · 2023-06-29T12:38:10Z

Rebase.
Handle gz.name in zlib_Compress_copy_impl().
Fix free()/PyMem_Free() mixup.
Drop the no longer used _read_exact().
Adapt the test_flush_flushes_compressor() test now that
_read_gzip_header() is gone.

iii-i · 2023-07-18T12:25:19Z

Rebased.
@Yhg1s & @gpshead could you have a look please?

rhpvorderman · 2023-10-03T13:12:40Z

This will break

try:
    ...
except BadGzipFile:
    ...

I don't know if this is a big problem though given that theoretically a zlib.error can also occur in such scenarios anyway.

iii-i · 2023-10-09T10:02:34Z

Thanks for having a look! That's indeed what test_bad_gzip_file() has detected. I thought we could live with that, but now that you brought that up, I found a simple workaround: convert zlib.errors that happen when gz_header_done != 1 to BadGzipFile.

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate and inflate performance on this platform by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead on s390x. The reason is that Python needs to write specific values into gzip header; and when this support was introduced in year 1997, there was indeed no better way to do this. Since v1.2.2.1 (2011) zlib provides inflateGetHeader() and deflateSetHeader() functions for that, so Python does not have to deal with the exact header and trailer formats anymore. Add the new interfaces to zlibmodule.c that make use of these functions: * Add mtime argument to zlib.compress(). * Add mtime and fname arguments to zlib.compressobj(). * Add gz_header_mtime and gz_header_done propeties to ZlibDecompressor. In Python modules, replace raw streams with gzip streams, make use of the new interfaces, and remove all mentions of crc32. 📜🤖 NEWS entry added by blurb_it. [1] madler/zlib#410

iii-i · 2023-10-09T10:47:18Z

The Windows failure is:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'D:\\a\\cpython\\cpython\\build\\test_python_3232�'

It looks unrelated to me.

rhpvorderman · 2023-10-20T13:08:35Z

@iii-i I have been thinking about this. I am not a CPython core developer but I did contribute quite heavily to the gzip module and I am maintaining python-isal.

This is a pretty massive code change, only so that zlib can write the header instead of python. But that does not add any extra functionality and it results in the same behaviour. This change only affects so that wbits31 can be set directly to the compress and decompressobj rather than wbits -15, which is only beneficial on one platform that supports it.

Isn't it possible to add a different wbits value for gzip reading and writing that does not read/write headers trailers and simply expose the zstate crc32 using an extra member (.crc) of the decompressobj and compressobj classes? Then all the python writing code can remain the same, all the zlib functions can remain the same and much less logic would need to be implemented. It would still solve your goal of being better for PowerPC, without the massive code overhaul.

iii-i · 2023-10-25T13:53:52Z

Isn't it possible to add a different wbits value for gzip reading and writing that does not read/write headers trailers and simply expose the zstate crc32 using an extra member (.crc) of the decompressobj and compressobj classes?

Do you mean extending zlib itself? It should be quite simple code-wise, but zlib accepts very few PRs these days, and I fear the chances that this would go through are low.

Also, we could consider creating a dependency on the DFLTCC zlib patch (it's very bad, just for the sake of argument) and say that zlib should update z_stream.adler even for raw streams when DFLTCC is in use, however, there is code out there that relies on this not happening: zlib-ng/zlib-ng#1390

rhpvorderman · 2023-10-25T14:53:01Z

Do you mean extending zlib itself? It should be quite simple code-wise, but zlib accepts very few PRs these days, and I fear the chances that this would go through are low.

No, I mean that it should be possible to tell the z_stream that already a header has been written and that the compression should be gzip. In that case the header can still be written by python while the CRC and deflate can be calculated in one go. I believe this is already possible in current zlib. Unfortunately, there are no handles in zlibmodule.c to handle this. I suppose this can be done by implementing an extra wbits value in zlibmodule.c (not in zlib itself) to handle this specific use case. This will require much less code than fully exposing the zlib api in the zlibmodule.

I think this can be accomplished using Z_BLOCK:

  The flush parameter of inflate() can be Z_NO_FLUSH, Z_SYNC_FLUSH, Z_FINISH,
Z_BLOCK, or Z_TREES.  Z_SYNC_FLUSH requests that inflate() flush as much
output as possible to the output buffer.  Z_BLOCK requests that inflate()
stop if and when it gets to the next deflate block boundary.  When decoding
the zlib or gzip format, this will cause inflate() to return immediately
after the header and before the first block.

Feed the zstream some fake header that is correct. Run inflate with Z_BLOCK and the zstream is in a state where it thinks the header is already processed. Now next_in can be set to actually process data.
The same should be able with deflate, allthough zlib.h is much less explicit about that.

Given the use case is speeding up a specific target platform, I think the PR should just do that, not also complicate the current python zlib api as a side effect. My gut feeling is that if these features were truly wanted, they would have been requested years ago. So it is better to keep it simple and just create a simple code path that covers this use case. This is just my opinion however, that of one person.

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress() and GzipFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. [1] madler/zlib#410 [2] python#103478

iii-i · 2023-11-17T12:28:01Z

Thank you for the suggestion. I've implemented the compression part of it in #112199. If it's accepted, I will send the decompression part separately.

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

@rhpvorderman

RHEL, SLES and Ubuntu for IBM zSystems (aka s390x) ship with a zlib optimization [1] that significantly improves deflate performance by using a specialized CPU instruction. This instruction not only compresses the data, but also computes a checksum. At the moment Pyhton's gzip support performs compression and checksum calculation separately, which creates unnecessary overhead. The reason is that Python needs to write specific values into gzip header, so it uses a raw stream instead of a gzip stream, and zlib does not compute a checksum for raw streams. The challenge with using gzip streams instead of zlib streams is dealing with zlib-generated gzip header, which we need to rather generate manually. Implement the method proposed by @rhpvorderman: use Z_BLOCK on the first deflate() call in order to stop before the first deflate block is emitted. The data that is emitted up until this point is zlib-generated gzip header, which should be discarded. Expose this new functionality by adding a boolean gzip_trailer argument to zlib.compress() and zlib.compressobj(). Make use of it in gzip.compress(), GzipFile and TarFile. The performance improvement varies depending on data being compressed, but it's in the ballpark of 40%. An alternative approach is to use the deflateSetHeader() function, introduced in zlib v1.2.2.1 (2011). This also works, but the change was deemed too intrusive [2]. 📜🤖 Added by blurb_it. [1] madler/zlib#410 [2] python#103478

iii-i requested a review from ethanfurman as a code owner April 12, 2023 18:07

bedevere-bot mentioned this pull request Apr 12, 2023

Read and write gzip header and trailer with zlib #103477

Open

bedevere-bot added the awaiting review label Apr 12, 2023

arhadthedev added the stdlib Standard Library Python modules in the Lib/ directory label Apr 12, 2023

iii-i force-pushed the gh-103477 branch from 1a3390a to 96b6c12 Compare April 12, 2023 21:46

iii-i force-pushed the gh-103477 branch from 96b6c12 to ecb2fc5 Compare May 31, 2023 12:02

iii-i force-pushed the gh-103477 branch 2 times, most recently from 0ed5988 to 1f28fdd Compare June 29, 2023 12:05

iii-i force-pushed the gh-103477 branch from 1f28fdd to b2a64b7 Compare July 18, 2023 11:50

iii-i force-pushed the gh-103477 branch from b2a64b7 to b93d185 Compare October 9, 2023 10:00

iii-i force-pushed the gh-103477 branch from b93d185 to b06e7e6 Compare October 9, 2023 10:06

iii-i mentioned this pull request Nov 17, 2023

gh-103477: Write gzip trailer with zlib #112199

Open

gpshead self-assigned this Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-103477: Read and write gzip header and trailer with zlib #103478

gh-103477: Read and write gzip header and trailer with zlib #103478

Uh oh!

iii-i commented Apr 12, 2023 •

edited

Loading

Uh oh!

encukou commented Apr 17, 2023

Uh oh!

iii-i commented May 31, 2023

Uh oh!

iii-i commented Jun 29, 2023

Uh oh!

iii-i commented Jul 18, 2023

Uh oh!

rhpvorderman commented Oct 3, 2023

Uh oh!

iii-i commented Oct 9, 2023

Uh oh!

iii-i commented Oct 9, 2023

Uh oh!

rhpvorderman commented Oct 20, 2023

Uh oh!

iii-i commented Oct 25, 2023 •

edited

Loading

Uh oh!

rhpvorderman commented Oct 25, 2023

Uh oh!

iii-i commented Nov 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

gh-103477: Read and write gzip header and trailer with zlib #103478

Are you sure you want to change the base?

gh-103477: Read and write gzip header and trailer with zlib #103478

Uh oh!

Conversation

iii-i commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

encukou commented Apr 17, 2023

Uh oh!

iii-i commented May 31, 2023

Uh oh!

iii-i commented Jun 29, 2023

Uh oh!

iii-i commented Jul 18, 2023

Uh oh!

rhpvorderman commented Oct 3, 2023

Uh oh!

iii-i commented Oct 9, 2023

Uh oh!

iii-i commented Oct 9, 2023

Uh oh!

rhpvorderman commented Oct 20, 2023

Uh oh!

iii-i commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhpvorderman commented Oct 25, 2023

Uh oh!

iii-i commented Nov 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

iii-i commented Apr 12, 2023 •

edited

Loading

iii-i commented Oct 25, 2023 •

edited

Loading