Threaded GzipNGFile implementation #9

rhpvorderman · 2023-02-06T09:35:45Z

The compression code in CPython's zlibmodule.c does drop the GIL when a Inflate or Deflate call is invoked. This means that it is possible in theory to compress files multithreaded. This is not implemented in the gzip.py module in CPython yet. However python-zlib-ng is an interesting project to actually try this out. When it is working this can be backported to CPython and python-isal. (Threaded level 1 in ISA-L should allow for multiple GB compressed writing speeds 😄 ).

The gzip format consists of a header and a trailer with one or multiple DEFLATE blocks in between. These DEFLATE blocks use 3 bits as the header. One of these sets the bit for being the last deflate block in the stream. I should look to the pigz implementation and see how this is handled.

As for the python implementation:

Buffering should be implemented first: Buffered writing for GzipFile #8
Once buffer is full or flushed the crc32 and the length should be updated
The bytes object should be pushed into a queue with an index (to keep the order correct)
Worker threads should take the bytes object from a queue, compress it and push the deflate block onto another queue together with the index.
A writer thread should take the compressed blocks from the queue ensuring it uses the index to incrementally write to an output file.

The first queue should be FIFO ensuring that the next output block to be written is always worked on. The queue should have a maximum size and block at the put step to prevent memory overflow. The second queue should also have a maximum and the get step should release the blocks in order. The put step should be rigged, so that it allows writing to the queue when the index is the next index required by get. This way we prevent blocking up if the next block is taking quite long to be processed and the blocks coming after it have filled up the queue.

I think this will be an interesting alternative for xopen compared to piping it into external applications. Ping @marcelm .

rhpvorderman mentioned this issue Dec 22, 2023

Integrate threading modules from python-isal #27

Merged

2 tasks

rhpvorderman closed this as completed in #27 Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threaded GzipNGFile implementation #9

Threaded GzipNGFile implementation #9

rhpvorderman commented Feb 6, 2023

Threaded GzipNGFile implementation #9

Threaded GzipNGFile implementation #9

Comments

rhpvorderman commented Feb 6, 2023