Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-43613: Faster implementation of gzip.compress and gzip.decompress #27941

Merged
merged 22 commits into from
Sep 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
608876c
Add test for zlib.compress wbits
rhpvorderman Mar 24, 2021
12fb5a5
Add wbits argument to zlib.compress
rhpvorderman Mar 24, 2021
1b34510
Use clinic to generate input
rhpvorderman Mar 24, 2021
4e90311
Update documentation for zlib
rhpvorderman Mar 24, 2021
9717252
Add blurb news entry for zlib.compress wbits parameter
rhpvorderman Mar 24, 2021
d70390e
Fix doc typo
rhpvorderman Mar 24, 2021
c819e3e
Remove unnecessary whitespace, add punctionation and complete sentences.
rhpvorderman Jun 1, 2021
1f3481f
Break line to comply with PEP-7
rhpvorderman Jun 1, 2021
8019932
Update blurb to include :func: reference
rhpvorderman Jun 1, 2021
0ea98cf
Remove erroneous double backticks
rhpvorderman Jun 1, 2021
d1c86dc
Faster gzip.compress implementation
rhpvorderman Aug 24, 2021
19a0358
More efficiently decompress gzip files in memory
rhpvorderman Aug 24, 2021
5155857
Ensure correct endianness
rhpvorderman Aug 24, 2021
fa188a6
Remove redundant line
rhpvorderman Aug 24, 2021
4e76cf5
Fix typos and test errors
rhpvorderman Aug 24, 2021
0280c95
Revert changing default on compress for backwards compatibility
rhpvorderman Aug 25, 2021
8ddee29
Update documentation with gzip speed improvements
rhpvorderman Aug 25, 2021
77f79fd
Add a blurb for gzip speed improvements
rhpvorderman Aug 25, 2021
ca3e543
Use + instead of bytes.join() method
rhpvorderman Aug 25, 2021
f881f7e
Reword docstring for read_gzip_header
rhpvorderman Aug 30, 2021
97a8100
Update docstring for gzip.compress
rhpvorderman Aug 30, 2021
eeb7766
Use subtest for zlib.compress/decompress test.
rhpvorderman Aug 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions Doc/library/gzip.rst
Original file line number Diff line number Diff line change
Expand Up @@ -174,19 +174,30 @@ The module defines the following items:

Compress the *data*, returning a :class:`bytes` object containing
the compressed data. *compresslevel* and *mtime* have the same meaning as in
the :class:`GzipFile` constructor above.
the :class:`GzipFile` constructor above. When *mtime* is set to ``0``, this
function is equivalent to :func:`zlib.compress` with *wbits* set to ``31``.
The zlib function is faster.

.. versionadded:: 3.2
.. versionchanged:: 3.8
Added the *mtime* parameter for reproducible output.
.. versionchanged:: 3.11
Speed is improved by compressing all data at once instead of in a
streamed fashion. Calls with *mtime* set to ``0`` are delegated to
:func:`zlib.compress` for better speed.

.. function:: decompress(data)

Decompress the *data*, returning a :class:`bytes` object containing the
uncompressed data.
uncompressed data. This function is capable of decompressing multi-member
gzip data (multiple gzip blocks concatenated together). When the data is
certain to contain only one member the :func:`zlib.decompress` function with
*wbits* set to 31 is faster.

.. versionadded:: 3.2

.. versionchanged:: 3.11
Speed is improved by decompressing members at once in memory instead of in
a streamed fashion.

.. _gzip-usage-examples:

Expand Down
46 changes: 28 additions & 18 deletions Doc/library/zlib.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,19 +47,43 @@ The available exception and functions in this module are:
platforms, use ``adler32(data) & 0xffffffff``.


.. function:: compress(data, /, level=-1)
.. function:: compress(data, /, level=-1, wbits=MAX_WBITS)

Compresses the bytes in *data*, returning a bytes object containing compressed data.
*level* is an integer from ``0`` to ``9`` or ``-1`` controlling the level of compression;
``1`` (Z_BEST_SPEED) is fastest and produces the least compression, ``9`` (Z_BEST_COMPRESSION)
is slowest and produces the most. ``0`` (Z_NO_COMPRESSION) is no compression.
The default value is ``-1`` (Z_DEFAULT_COMPRESSION). Z_DEFAULT_COMPRESSION represents a default
compromise between speed and compression (currently equivalent to level 6).

.. _compress-wbits:

The *wbits* argument controls the size of the history buffer (or the
"window size") used when compressing data, and whether a header and
trailer is included in the output. It can take several ranges of values,
defaulting to ``15`` (MAX_WBITS):

* +9 to +15: The base-two logarithm of the window size, which
therefore ranges between 512 and 32768. Larger values produce
better compression at the expense of greater memory usage. The
resulting output will include a zlib-specific header and trailer.

* −9 to −15: Uses the absolute value of *wbits* as the
window size logarithm, while producing a raw output stream with no
header or trailing checksum.

* +25 to +31 = 16 + (9 to 15): Uses the low 4 bits of the value as the
window size logarithm, while including a basic :program:`gzip` header
and trailing checksum in the output.

Raises the :exc:`error` exception if any error occurs.

.. versionchanged:: 3.6
*level* can now be used as a keyword parameter.

.. versionchanged:: 3.11
The *wbits* parameter is now available to set window bits and
compression type.

.. function:: compressobj(level=-1, method=DEFLATED, wbits=MAX_WBITS, memLevel=DEF_MEM_LEVEL, strategy=Z_DEFAULT_STRATEGY[, zdict])

Expand All @@ -76,23 +100,9 @@ The available exception and functions in this module are:
*method* is the compression algorithm. Currently, the only supported value is
:const:`DEFLATED`.

The *wbits* argument controls the size of the history buffer (or the
"window size") used when compressing data, and whether a header and
trailer is included in the output. It can take several ranges of values,
defaulting to ``15`` (MAX_WBITS):

* +9 to +15: The base-two logarithm of the window size, which
therefore ranges between 512 and 32768. Larger values produce
better compression at the expense of greater memory usage. The
resulting output will include a zlib-specific header and trailer.

* −9 to −15: Uses the absolute value of *wbits* as the
window size logarithm, while producing a raw output stream with no
header or trailing checksum.

* +25 to +31 = 16 + (9 to 15): Uses the low 4 bits of the value as the
window size logarithm, while including a basic :program:`gzip` header
and trailing checksum in the output.
The *wbits* parameter controls the size of the history buffer (or the
"window size"), and what header and trailer format will be used. It has
the same meaning as `described for compress() <#compress-wbits>`__.

The *memLevel* argument controls the amount of memory used for the
internal compression state. Valid values range from ``1`` to ``9``.
Expand Down
161 changes: 108 additions & 53 deletions Lib/gzip.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,59 @@ def __iter__(self):
return self._buffer.__iter__()


def _read_exact(fp, n):
'''Read exactly *n* bytes from `fp`

This method is required because fp may be unbuffered,
i.e. return short reads.
'''
data = fp.read(n)
while len(data) < n:
b = fp.read(n - len(data))
if not b:
raise EOFError("Compressed file ended before the "
"end-of-stream marker was reached")
data += b
return data


def _read_gzip_header(fp):
'''Read a gzip header from `fp` and progress to the end of the header.

Returns last mtime if header was present or None otherwise.
'''
magic = fp.read(2)
if magic == b'':
return None

if magic != b'\037\213':
raise BadGzipFile('Not a gzipped file (%r)' % magic)

(method, flag, last_mtime) = struct.unpack("<BBIxx", _read_exact(fp, 8))
if method != 8:
raise BadGzipFile('Unknown compression method')

if flag & FEXTRA:
# Read & discard the extra field, if present
extra_len, = struct.unpack("<H", _read_exact(fp, 2))
_read_exact(fp, extra_len)
if flag & FNAME:
# Read and discard a null-terminated string containing the filename
while True:
s = fp.read(1)
if not s or s==b'\000':
break
if flag & FCOMMENT:
# Read and discard a null-terminated string containing a comment
while True:
s = fp.read(1)
if not s or s==b'\000':
break
if flag & FHCRC:
_read_exact(fp, 2) # Read & discard the 16-bit header CRC
return last_mtime


class _GzipReader(_compression.DecompressReader):
def __init__(self, fp):
super().__init__(_PaddedFile(fp), zlib.decompressobj,
Expand All @@ -415,53 +468,11 @@ def _init_read(self):
self._crc = zlib.crc32(b"")
self._stream_size = 0 # Decompressed size of unconcatenated stream

def _read_exact(self, n):
'''Read exactly *n* bytes from `self._fp`

This method is required because self._fp may be unbuffered,
i.e. return short reads.
'''

data = self._fp.read(n)
while len(data) < n:
b = self._fp.read(n - len(data))
if not b:
raise EOFError("Compressed file ended before the "
"end-of-stream marker was reached")
data += b
return data

def _read_gzip_header(self):
magic = self._fp.read(2)
if magic == b'':
last_mtime = _read_gzip_header(self._fp)
if last_mtime is None:
return False

if magic != b'\037\213':
raise BadGzipFile('Not a gzipped file (%r)' % magic)

(method, flag,
self._last_mtime) = struct.unpack("<BBIxx", self._read_exact(8))
if method != 8:
raise BadGzipFile('Unknown compression method')

if flag & FEXTRA:
# Read & discard the extra field, if present
extra_len, = struct.unpack("<H", self._read_exact(2))
self._read_exact(extra_len)
if flag & FNAME:
# Read and discard a null-terminated string containing the filename
while True:
s = self._fp.read(1)
if not s or s==b'\000':
break
if flag & FCOMMENT:
# Read and discard a null-terminated string containing a comment
while True:
s = self._fp.read(1)
if not s or s==b'\000':
break
if flag & FHCRC:
self._read_exact(2) # Read & discard the 16-bit header CRC
self._last_mtime = last_mtime
return True

def read(self, size=-1):
Expand Down Expand Up @@ -524,7 +535,7 @@ def _read_eof(self):
# We check that the computed CRC and size of the
# uncompressed data matches the stored values. Note that the size
# stored is the true file size mod 2**32.
crc32, isize = struct.unpack("<II", self._read_exact(8))
crc32, isize = struct.unpack("<II", _read_exact(self._fp, 8))
if crc32 != self._crc:
raise BadGzipFile("CRC check failed %s != %s" % (hex(crc32),
hex(self._crc)))
Expand All @@ -544,21 +555,65 @@ def _rewind(self):
super()._rewind()
self._new_member = True


def _create_simple_gzip_header(compresslevel: int,
mtime = None) -> bytes:
"""
Write a simple gzip header with no extra fields.
:param compresslevel: Compresslevel used to determine the xfl bytes.
:param mtime: The mtime (must support conversion to a 32-bit integer).
:return: A bytes object representing the gzip header.
"""
if mtime is None:
mtime = time.time()
if compresslevel == _COMPRESS_LEVEL_BEST:
xfl = 2
elif compresslevel == _COMPRESS_LEVEL_FAST:
xfl = 4
else:
xfl = 0
# Pack ID1 and ID2 magic bytes, method (8=deflate), header flags (no extra
# fields added to header), mtime, xfl and os (255 for unknown OS).
return struct.pack("<BBBBLBB", 0x1f, 0x8b, 8, 0, int(mtime), xfl, 255)


def compress(data, compresslevel=_COMPRESS_LEVEL_BEST, *, mtime=None):
"""Compress data in one shot and return the compressed string.
Optional argument is the compression level, in range of 0-9.

compresslevel sets the compression level in range of 0-9.
mtime can be used to set the modification time. The modification time is
set to the current time by default.
"""
buf = io.BytesIO()
with GzipFile(fileobj=buf, mode='wb', compresslevel=compresslevel, mtime=mtime) as f:
f.write(data)
return buf.getvalue()
if mtime == 0:
# Use zlib as it creates the header with 0 mtime by default.
# This is faster and with less overhead.
return zlib.compress(data, level=compresslevel, wbits=31)
header = _create_simple_gzip_header(compresslevel, mtime)
trailer = struct.pack("<LL", zlib.crc32(data), (len(data) & 0xffffffff))
# Wbits=-15 creates a raw deflate block.
return header + zlib.compress(data, wbits=-15) + trailer


def decompress(data):
"""Decompress a gzip compressed string in one shot.
Return the decompressed string.
"""
with GzipFile(fileobj=io.BytesIO(data)) as f:
return f.read()
decompressed_members = []
while True:
fp = io.BytesIO(data)
if _read_gzip_header(fp) is None:
return b"".join(decompressed_members)
# Use a zlib raw deflate compressor
do = zlib.decompressobj(wbits=-zlib.MAX_WBITS)
# Read all the data except the header
decompressed = do.decompress(data[fp.tell():])
crc, length = struct.unpack("<II", do.unused_data[:8])
if crc != zlib.crc32(decompressed):
raise BadGzipFile("CRC check failed")
if length != (len(decompressed) & 0xffffffff):
raise BadGzipFile("Incorrect length of data produced")
decompressed_members.append(decompressed)
data = do.unused_data[8:].lstrip(b"\x00")


def main():
Expand Down
7 changes: 7 additions & 0 deletions Lib/test/test_zlib.py
Original file line number Diff line number Diff line change
Expand Up @@ -831,6 +831,13 @@ def test_wbits(self):
dco = zlib.decompressobj(32 + 15)
self.assertEqual(dco.decompress(gzip), HAMLET_SCENE)

for wbits in (-15, 15, 31):
with self.subTest(wbits=wbits):
expected = HAMLET_SCENE
actual = zlib.decompress(
zlib.compress(HAMLET_SCENE, wbits=wbits), wbits=wbits
)
self.assertEqual(expected, actual)

def choose_lines(source, number, seed=None, generator=random):
"""Return a list of number lines randomly chosen from the source"""
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:func:`zlib.compress` now accepts a wbits parameter which allows users to
compress data as a raw deflate block without zlib headers and trailers in
one go. Previously this required instantiating a ``zlib.compressobj``. It
also provides a faster alternative to ``gzip.compress`` when wbits=31 is
used.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Improve the speed of :func:`gzip.compress` and :func:`gzip.decompress` by
compressing and decompressing at once in memory instead of in a streamed
fashion.
Loading