-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not raise error if code equals range in get_bit #16
Conversation
It appears that this is a valid state and that some files can have data encoded like this. An example of such a file has been added to the tests. Fixes gendx#15
Thanks for your contribution! One thing I wanted to do but didn't have the time yet was to have tests comparing the implementation to established libraries such as https://crates.io/crates/lzma-sys or https://crates.io/crates/xz2 (backed by a reference C implementation). That is, the I don't know how hard it would be to decompress a raw LZMA stream with these crates, as opposed to an xz file. If you manage to implement something like that, it would be very helpful in testing the correctness of such changes! Otherwise, I'll get back to the details when I have more time. But if you have a link to another LZMA implementation where the |
Ah, yes. Sorry, I intended to link to the relevant code in XZ Utils, but forgot to do so. I am not very familiar with LZMA nor XZ Utils from before, but while attempting to compare the two implementations, I concluded that the corresponding counterpart to get_bit should be here: It is used from here: As you can see, there is no test here, but a few additional operations after subtracting the range from the code. I am not sure what they correspond to in lzma_rs. I did try to use the Rust libraries you mentioned above for our purposes (decoding OpenCTM files), but since they did not work directly on lzma streams, they were not suitable. I am not sure if I will be able to test them as well. We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways. By the way, what did you base your implementation of lzma_rs on? Was it an existing codebase, a spec, or do you just know it by heart? 😅 There appears to be a quite a few resources out there on LZMA, but the details between them appear to vary quite a lot. It was just interesting to see how different your implementation was to that of XZ Utils. (Yours is way more readable, by the way - it took a while to wrap my head around those C macros, ifdefs and intertwined while-loops and switch-statements...) |
Thanks, this file will be helpful in double-checking how the range coder works in detail :)
On this topic, turns out that I already wrote a dumb xz encoder https://github.com/gendx/lzma-rs/blob/master/src/encode/xz.rs, that currently packages an LZMA2 stream which itself encapsulates raw uncompressed data. But it shouldn't be too hard to turn that into a packaging tool that takes an already encoded LZMA stream and packages it as an xz archive, which would allow easier comparison with other libraries.
Good to hear this is useful :) Btw, the current info/debug/tracing statements induce some overhead due to a lot of extra code that is not removed by the compiler (as these can be enabled by an environment variable). If performance is critical to you, let me know as there is room for improvement there, I just didn't have time to clean it up.
I had an old project of writing various C++ parsers to learn (and teach?) about file formats: https://github.com/gendx/tyrex. My goal was to have readable (rather than optimized) implementations. The LZMA code was here. I think I originally read a mix of Wikipedia, the SDK and a few other resources about range coding, and tested on a few examples. Then I ported my old C++ code to Rust as I was learning Rust and realized there was no pure-Rust LZMA codec. So I don't know the specs by heart :) Are there even any specs for LZMA other than the code? I was also planning on writing on my blog an introduction to how LZMA works. Might end up doing that sooner than later if lzma-rs is getting interest ;) |
tests/files/README.md
Outdated
This is a file that causes the code and range to be equal at some point during decoding LZMA data. | ||
Previously, this file would raise an `LZMAError("Corrupted range coding")`, | ||
although the file is a valid LZMA file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document how you generated this file? From the looks of it, it's not really "random" in the sense of head -c N /dev/urandom
. Or is the file more structured than that (you mentioned the OpenCTM format, could you add a link to what OpenCTM is)?
My concerns are that:
- The file is quite large, which will impact the size of the repository.
- Given the file size, I also want to make sure that the license for the file's contents is compatible with the repository's license (MIT), and that it doesn't contain any sensitive/personal/private data.
Otherwise I'm wondering how many small random files would be necessary to reproduce the code == range
case. Or whether the dumb encoder could be used to generate this case with a much smaller file (e.g. with an all-zeros or all-ones file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to hear that you are making sure it adheres to the license because I actually put down a bit of work to make it sure it does.
It is not entirely random, but not so far from it. It comes from this beauty of a Blender scene I created using a mix of the Array and Build modifier and a lot of duplication.
It has subsequently been optimized and packed into multiple OpenCTM files. After that, I manually extracted the LZMA-encoded part (the vertices) of a bad file and added the unpacked size to the binary, which is missing in OpenCTM (see #17).
The reason the file was created was exactly because I wanted to reproduce the problem in a file that did not contain sensitive or private data. The problem was originally discovered in other files that I do not have the liberty to share freely ;)
So the licensing is no problem - we are happy to share the attached file under the MIT license or any other license of your choosing.
I am not sure if I am able to reproduce the issue in a smaller file. I tried writing some "random" data with a similar structure (a lot of zeros at the start of the file), but was unable to reproduce the issue.
That would be great! Performance is in fact very important to us. I have actually been looking into optimizing lzma_rs after profiling a bit. If I recall correctly, it is about 3x-6x slower than xz when running natively and about as fast as lzma-js when running in the browser. I think what I found was that there was a fair bit of allocation and deallocation going on when we were parsing a large number of files/chunks. I was therefore thinking about restructuring the code so the decoder could be re-used instead of set up and torn down for each file/chunk. It might be that this is not so important if the performance impact is actually coming from the log-statements. I actually did not know that those were not optimized out in a release build.
Cool! I will check that out!
Not that I know of. That is partially why I asked, in case you knew of some resourced I had not found ;)
That would be awesome! I was thinking about doing something similar: To write down what I learned along the way while trying to understand more of LZMA. However, I do not think my understanding is anywhere near a level where I can confidently explain how it all works :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. I checked against https://github.com/xz-mirror/xz/tree/master/src/liblzma/lzma and https://github.com/jljusten/LZMA-SDK, which both have the behavior you describe.
More tests can be added later once #20 is implemented, and once someone figures out how to unit test this.
In the meantime, there's no need to block this pull request, I just have a comment on the phrasing.
bors r+ |
Merge conflict |
bors r+ |
16: Do not raise error if code equals range in get_bit r=gendx a=dragly It appears that code being equal to range is a valid state and that some files can have data encoded like this. An example of such a file has been added to the tests. ### Pull Request Overview This pull request fixes #15 ### Testing Strategy This pull request was tested by... - [ ] Added relevant unit tests. - [x] Added relevant end-to-end tests (such as `.lzma`, `.lzma2`, `.xz` files). ### Supporting Documentation and References The original data was produced by processing a randomly generated geometry with OpenCTM. OpenCTM files can contain embedded LZMA compressed data. This appears to be generated with the liblzma library. The data was then extracted to make a standalone file that is attached in this pull request. ### TODO or Help Wanted None Co-authored-by: Svenn-Arne Dragly <dragly@cognite.com> Co-authored-by: gendx <gendx@users.noreply.github.com> Co-authored-by: Svenn-Arne Dragly <s@dragly.com>
Build succeeded |
It appears that code being equal to range is a valid state and that some files can have
data encoded like this. An example of such a file has been added to the
tests.
Pull Request Overview
This pull request fixes #15
Testing Strategy
This pull request was tested by...
.lzma
,.lzma2
,.xz
files).Supporting Documentation and References
The original data was produced by processing a randomly generated geometry with OpenCTM. OpenCTM files can contain embedded LZMA compressed data. This appears to be generated with the liblzma library. The data was then extracted to make a standalone file that is attached in this pull request.
TODO or Help Wanted
None