Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

append to brotli compressed file #628

Open
worenga opened this issue Dec 6, 2017 · 10 comments
Open

append to brotli compressed file #628

worenga opened this issue Dec 6, 2017 · 10 comments

Comments

@worenga
Copy link

worenga commented Dec 6, 2017

Is it possible to append data to a brotli compressed file such that it is compressed automatically?

The following would work, e.g. for gzip

$echo 'foo' | gzip >! test.gz
$echo 'foo' | gzip >> test.gz
zcat test
foo
foo

While the same approach would fail using brotli with

corrupt input

@eustas
Copy link
Collaborator

eustas commented Dec 6, 2017

No. Brotli is a bare data stream format, so we decided to make it an error, if something goes after a stream.

The good news, is that we are working on framing format, that (likely) will provide such ability... and much more =) (with some overhead being paid, of course).

@eustas eustas closed this as completed Dec 6, 2017
@danielrh
Copy link

It might be possible to structure a Brotli stream to have such a property with a few small tweaks to the compressor:

  • First you'd make sure the last METABLOCK was empty, so it was easy to identify and drop the last Metablock (the final 2 one bits, followed by zeros)

  • Then you could have an ISUNCOMPRESSED metablock of size two, with the first two bytes of the data stream... so that your priors were correctly chosen

  • Then set the compressor to ignore matches in the dictionary

With these restrictions you should be able to concatenate them together after dropping the last 2 nonzero bits, right?

You may need a bit more magic, like by inserting a MNIBBLES=0 block to byte-align the metablocks at the end if you don't want to concatenate on the bit-level

@danielrh
Copy link

I solved this issue in the drop-in brotli library here https://github.com/dropbox/rust-brotli by doing the above 3 ideas and by disabling the recent items in the distance map.
Programatic usage can be seen here: https://github.com/dropbox/rust-brotli/blob/master/c/catbrotli.c
And the brotli.c file included in this package was modified slightly to pass in the CATABLE option https://github.com/dropbox/rust-brotli/blob/master/c/brotli.c#L412

Hope that helps!

@eustas
Copy link
Collaborator

eustas commented Oct 30, 2018

The problem with "catable" brotli is that it is impossible to prove that given file is "catable" without fully decompressing it.
Internally we have encoder that allows parallel encoding, i.e. encoders could produce streams that could be appended one after another... And going to publish that soon. Of course that is different story... But it does not ignore matches in dictionary and this might produce denser stream.

BTW, thank you, Daniel for developing rust-brotli, that is awesome project!

@danielrh
Copy link

Yes it's true that it is impossible to verify without decompressing. That is one of the main reasons I added this header magic number as the first metadata metablock: the header contains information about whether the file was designed to concatenate. Of course that's advisory, you would still need to decompress to fully verify. It's still likely faster than compressing the concatenated chunk. Perhaps the default mode should refuse to concatenate the file if the header is missing. It has some heuristics already to look for 'concatability', which rule out files generated with default brotli-like tools.

And you are correct: rust-brotli uses this catable flag internally to make multithreaded files. I think for internally created files, it could allow dictionary usage from any of the threads since we prepend the previous parts of the file to the ring buffer, so naturally it should look farther for the dictionary, but I haven't tried that mode of operation

It doesn't seem to me that splitting a file N ways often results in Nx improvement: it appears that certain parts of the file require significantly more CPU time than other parts. I haven't profiled the compression much yet.

@eustas eustas reopened this Aug 27, 2020
@rifler
Copy link

rifler commented Jan 13, 2021

The good news, is that we are working on framing format, that (likely) will provide such ability... and much more =) (with some overhead being paid, of course).

Hi, can you please provide issues/plans, where we can read about this?

@s-sols
Copy link

s-sols commented Mar 23, 2021

No. Brotli is a bare data stream format, so we decided to make it an error, if something goes after a stream.

The good news, is that we are working on framing format, that (likely) will provide such ability... and much more =) (with some overhead being paid, of course).

There is a need to combine several precompressed chunks. This chunks might be precompressed in the special format that allows concatenating in any sequence. This need relates to output precached content from webserver.

Have any related features been implemented in the library?

Thanks.

@danielrh
Copy link

@s-sols : I have created a binary-compatible brotli library here with a new option flag to create concattable files here:
https://github.com/dropbox/rust-brotli by doing the above 3 ideas and by disabling the recent items in the distance map.
Programatic usage can be seen here: https://github.com/dropbox/rust-brotli/blob/master/c/catbrotli.c
And the brotli.c file included in this package was modified slightly to pass in the CATABLE option https://github.com/dropbox/rust-brotli/blob/master/c/brotli.c#L412
If you don't have access to a rust compiler, c code can be created through use of the mrustc package. lmk if you have issues

@dgtlmoon
Copy link

dgtlmoon commented Nov 2, 2022

@danielrh hey super cool work! I got the c libraries to compile without problem

question about python and appending to an existing brotli "stream" https://github.com/dropbox/rust-brotli/blob/master/c/py/brotli_test.py

how would that work?

Are you able to add a that py test case there in that src?

Not quite sure where that existing compressed brotli would get passed to BrotliCompress

        output = BrotliCompress(self.test_data,
                                {
                                    BROTLI_PARAM_QUALITY:5,
                                    BROTLI_PARAM_CATABLE:1,
                                    BROTLI_PARAM_MAGIC_NUMBER:1,
                                },
                                1)
        rt = BrotliDecode(output, 2)

btw https://lib.rs/crates/brotli-ffi

@eustas
Copy link
Collaborator

eustas commented Jun 20, 2023

Hmm. Reconsidering. While brotli stream does not like "tails" we can overcome that in CLI. Will see if we can have it in v1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants