You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The compression code in CPython's zlibmodule.c does drop the GIL when a Inflate or Deflate call is invoked. This means that it is possible in theory to compress files multithreaded. This is not implemented in the gzip.py module in CPython yet. However python-zlib-ng is an interesting project to actually try this out. When it is working this can be backported to CPython and python-isal. (Threaded level 1 in ISA-L should allow for multiple GB compressed writing speeds 😄 ).
The gzip format consists of a header and a trailer with one or multiple DEFLATE blocks in between. These DEFLATE blocks use 3 bits as the header. One of these sets the bit for being the last deflate block in the stream. I should look to the pigz implementation and see how this is handled.
Once buffer is full or flushed the crc32 and the length should be updated
The bytes object should be pushed into a queue with an index (to keep the order correct)
Worker threads should take the bytes object from a queue, compress it and push the deflate block onto another queue together with the index.
A writer thread should take the compressed blocks from the queue ensuring it uses the index to incrementally write to an output file.
The first queue should be FIFO ensuring that the next output block to be written is always worked on. The queue should have a maximum size and block at the put step to prevent memory overflow. The second queue should also have a maximum and the get step should release the blocks in order. The put step should be rigged, so that it allows writing to the queue when the index is the next index required by get. This way we prevent blocking up if the next block is taking quite long to be processed and the blocks coming after it have filled up the queue.
I think this will be an interesting alternative for xopen compared to piping it into external applications. Ping @marcelm .
The text was updated successfully, but these errors were encountered:
The compression code in CPython's zlibmodule.c does drop the GIL when a Inflate or Deflate call is invoked. This means that it is possible in theory to compress files multithreaded. This is not implemented in the gzip.py module in CPython yet. However python-zlib-ng is an interesting project to actually try this out. When it is working this can be backported to CPython and python-isal. (Threaded level 1 in ISA-L should allow for multiple GB compressed writing speeds 😄 ).
The gzip format consists of a header and a trailer with one or multiple DEFLATE blocks in between. These DEFLATE blocks use 3 bits as the header. One of these sets the bit for being the last deflate block in the stream. I should look to the pigz implementation and see how this is handled.
As for the python implementation:
The first queue should be FIFO ensuring that the next output block to be written is always worked on. The queue should have a maximum size and block at the put step to prevent memory overflow. The second queue should also have a maximum and the get step should release the blocks in order. The put step should be rigged, so that it allows writing to the queue when the index is the next index required by get. This way we prevent blocking up if the next block is taking quite long to be processed and the blocks coming after it have filled up the queue.
I think this will be an interesting alternative for xopen compared to piping it into external applications. Ping @marcelm .
The text was updated successfully, but these errors were encountered: