Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded GzipNGFile implementation #9

Closed
rhpvorderman opened this issue Feb 6, 2023 · 0 comments · Fixed by #27
Closed

Threaded GzipNGFile implementation #9

rhpvorderman opened this issue Feb 6, 2023 · 0 comments · Fixed by #27

Comments

@rhpvorderman
Copy link
Contributor

The compression code in CPython's zlibmodule.c does drop the GIL when a Inflate or Deflate call is invoked. This means that it is possible in theory to compress files multithreaded. This is not implemented in the gzip.py module in CPython yet. However python-zlib-ng is an interesting project to actually try this out. When it is working this can be backported to CPython and python-isal. (Threaded level 1 in ISA-L should allow for multiple GB compressed writing speeds 😄 ).

The gzip format consists of a header and a trailer with one or multiple DEFLATE blocks in between. These DEFLATE blocks use 3 bits as the header. One of these sets the bit for being the last deflate block in the stream. I should look to the pigz implementation and see how this is handled.

As for the python implementation:

  • Buffering should be implemented first: Buffered writing for GzipFile #8
  • Once buffer is full or flushed the crc32 and the length should be updated
  • The bytes object should be pushed into a queue with an index (to keep the order correct)
  • Worker threads should take the bytes object from a queue, compress it and push the deflate block onto another queue together with the index.
  • A writer thread should take the compressed blocks from the queue ensuring it uses the index to incrementally write to an output file.

The first queue should be FIFO ensuring that the next output block to be written is always worked on. The queue should have a maximum size and block at the put step to prevent memory overflow. The second queue should also have a maximum and the get step should release the blocks in order. The put step should be rigged, so that it allows writing to the queue when the index is the next index required by get. This way we prevent blocking up if the next block is taking quite long to be processed and the blocks coming after it have filled up the queue.

I think this will be an interesting alternative for xopen compared to piping it into external applications. Ping @marcelm .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant