Add support for specifying the unpacked size outside of header #17

dragly · 2019-12-09T20:44:09Z

Some LZMA streams, such as those used in OpenCTM, are encoded without
the unpacked size specified in the header. This is possible to read in
some of the C implementations of LZMA by specifying a header size and
providing the unpacked size as an option to the decoder.

This change adds the same possibility to lzma_rs in a typesafe manner,
where the unpacked size and whether it should be written to the header
is specified explicitly.

Pull Request Overview

This pull request adds Options objects to two new public modules named compress and decompress that can be used to specify whether the unpacked size should be written to and read from the LZMA header.

Testing Strategy

This pull request was tested by...

Added relevant unit tests.
Added relevant end-to-end tests (such as .lzma, .lzma2, .xz files).

Supporting Documentation and References

This exotic use of the LZMA header is only indicated in the OpenCTM specification itself, where it specifically states that the offset of the stream is only 9 bytes, while it should have been 17 bytes with the unpacked size in place. This is based on 4 bytes from OpenCTM itself, 1 byte for the props, 4 bytes for the dict size and the missing 8 bytes for the unpacked size.

It can also be seen from the OpenCTM source code that the LZMA header written is only 5 bytes:

https://github.com/Danny02/OpenCTM/blob/243a343bd23bbeef8731f06ed91e3996604e1af4/lib/stream.c#L311

TODO or Help Wanted

This pull request still needs a bit of bikeshedding on the names of the modules and enums used in the options :)

Some LZMA streams, such as those used in OpenCTM, are encoded without the unpacked size specified in the header. This is possible to read in some of the C implementations of LZMA by specifying a header size and providing the unpacked size as an option to the decoder. This change adds the same possibility to lzma_rs in a typesafe manner, where the unpacked size and whether it should be written to the header is specified explicitly.

gendx

Overall looks great :) Just a few comments.

src/decode/options.rs

src/encode/options.rs

src/lib.rs

tests/lzma.rs

dragly · 2019-12-11T08:58:04Z

Thanks for the comments!

After writing the tests, I realized that the end-of-stream marker is always written. However, this is documented in XZ as only being supposed to be written if it is not provided in the header.

We should probably not write this if the value is provided and written in the header. Is removing the bits encoded in Encoder::finish the way to achieve this?

lzma-rs/src/encode/dumbencoder.rs

Lines 71 to 97 in 4d46835

    
           // Write end-of-stream marker 
        
           let pos_state = input_len & 3; 
        
           // Match 
        
           self.rangecoder 
        
               .encode_bit(&mut self.is_match[pos_state], true)?; 
        
           // New distance 
        
           self.rangecoder.encode_bit(&mut 0x400, false)?; 
        
           // Dummy len, as small as possible (len = 0) 
        
           for _ in 0..4 { 
        
               self.rangecoder.encode_bit(&mut 0x400, false)?; 
        
           } 
        
           // Distance marker = 0xFFFFFFFF 
        
           // pos_slot = 63 
        
           for _ in 0..6 { 
        
               self.rangecoder.encode_bit(&mut 0x400, true)?; 
        
           } 
        
           // num_direct_bits = 30 
        
           // result = 3 << 30 = C000_0000 
        
           //        + 3FFF_FFF0  (26 bits) 
        
           //        + F          ( 4 bits) 
        
           for _ in 0..30 { 
        
               self.rangecoder.encode_bit(&mut 0x400, true)?; 
        
           } 
        
           //        = FFFF_FFFF

Not sure exactly what should be done if it is known, but not provided, like in the case of OpenCTM. I will see if I can figure out what happens in other libs in that case.

gendx · 2019-12-11T22:28:21Z

Thanks for the comments!

After writing the tests, I realized that the end-of-stream marker is always written. However, this is documented in XZ as only being supposed to be written if it is not provided in the header.

We should probably not write this if the value is provided and written in the header. Is removing the bits encoded in Encoder::finish the way to achieve this?

lzma-rs/src/encode/dumbencoder.rs

Lines 71 to 97 in 4d46835

// Write end-of-stream marker

let pos_state = input_len & 3;

// Match

self.rangecoder

.encode_bit(&mut self.is_match[pos_state], true)?;

// New distance

self.rangecoder.encode_bit(&mut 0x400, false)?;

// Dummy len, as small as possible (len = 0)

for _ in 0..4 {

self.rangecoder.encode_bit(&mut 0x400, false)?;

}

// Distance marker = 0xFFFFFFFF

// pos_slot = 63

for _ in 0..6 {

self.rangecoder.encode_bit(&mut 0x400, true)?;

}

// num_direct_bits = 30

// result = 3 << 30 = C000_0000

// + 3FFF_FFF0 (26 bits)

// + F ( 4 bits)

for _ in 0..30 {

self.rangecoder.encode_bit(&mut 0x400, true)?;

}

// = FFFF_FFFF

Not sure exactly what should be done if it is known, but not provided, like in the case of OpenCTM. I will see if I can figure out what happens in other libs in that case.

As the name goes, the dumbencoder is currently quite dumb. But I guess removing that part of the Encoder::finish would work. You can also test the result against unlzma in the command line to see if that works.

Also, make the Options derive Clone and Debug.

These tests do not make sense because they provided a value (None) that indicated an end marker to be expected in cases where there should be none.

dragly · 2019-12-13T23:45:21Z

Seems like it works fine with unlzma, so I removed writing the end marker if the unpacked size is written to the header.

I also realized that a couple of the tests did not really make sense after this and removed those.

…for-custom-unpacked-size

gendx · 2019-12-16T21:56:15Z

bors r+

17: Add support for specifying the unpacked size outside of header r=gendx a=dragly Some LZMA streams, such as those used in OpenCTM, are encoded without the unpacked size specified in the header. This is possible to read in some of the C implementations of LZMA by specifying a header size and providing the unpacked size as an option to the decoder. This change adds the same possibility to lzma_rs in a typesafe manner, where the unpacked size and whether it should be written to the header is specified explicitly. ### Pull Request Overview This pull request adds `Options` objects to two new public modules named `compress` and `decompress` that can be used to specify whether the unpacked size should be written to and read from the LZMA header. ### Testing Strategy This pull request was tested by... - [x] Added relevant unit tests. - [ ] Added relevant end-to-end tests (such as `.lzma`, `.lzma2`, `.xz` files). ### Supporting Documentation and References This exotic use of the LZMA header is only indicated in the [OpenCTM specification itself](http://openctm.sourceforge.net/media/FormatSpecification.pdf), where it specifically states that the offset of the stream [is only 9 bytes](https://github.com/Danny02/OpenCTM/blob/243a343bd23bbeef8731f06ed91e3996604e1af4/doc/FormatSpecification.tex#L91), while it should have been 17 bytes with the unpacked size in place. This is based on 4 bytes from OpenCTM itself, 1 byte for the props, 4 bytes for the dict size and the missing 8 bytes for the unpacked size. It can also be seen from the OpenCTM source code that the LZMA header written is only 5 bytes: https://github.com/Danny02/OpenCTM/blob/243a343bd23bbeef8731f06ed91e3996604e1af4/lib/stream.c#L311 ### TODO or Help Wanted This pull request still needs a bit of bikeshedding on the names of the modules and enums used in the options :) Co-authored-by: Svenn-Arne Dragly <s@dragly.com> Co-authored-by: Svenn-Arne Dragly <dragly@cognite.com> Co-authored-by: G. Endignoux <ggendx@gmail.com>

bors · 2019-12-16T22:10:34Z

Build succeeded

continuous-integration/travis-ci/push

dragly mentioned this pull request Dec 10, 2019

Do not raise error if code equals range in get_bit #16

Merged

2 tasks

gendx requested changes Dec 10, 2019

View reviewed changes

Refactor and update according to suggestions

1e9784c

dragly added 2 commits December 13, 2019 22:45

Only write end marker if unpacked size not present in header

223ce43

Also, make the Options derive Clone and Debug.

Skip tests that no longer makes sense

f9e38c2

These tests do not make sense because they provided a value (None) that indicated an end marker to be expected in cases where there should be none.

dragly and others added 3 commits December 14, 2019 00:47

Merge remote-tracking branch 'origin/master' into dragly/add-support-…

85cdd78

…for-custom-unpacked-size

Cargo format

69b0491

Make options Copy.

cb55801

gendx approved these changes Dec 16, 2019

View reviewed changes

bors bot merged commit cb55801 into gendx:master Dec 16, 2019

This was referenced Dec 16, 2019

Support legacy LZMA format with unpacked_size 32bit long #13

Open

LZMAError("Expected unpacked size of 149198 but decompressed to 483334")' #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for specifying the unpacked size outside of header #17

Add support for specifying the unpacked size outside of header #17

dragly commented Dec 9, 2019

gendx left a comment

dragly commented Dec 11, 2019

gendx commented Dec 11, 2019

dragly commented Dec 13, 2019

gendx commented Dec 16, 2019

bors bot commented Dec 16, 2019

Add support for specifying the unpacked size outside of header #17

Add support for specifying the unpacked size outside of header #17

Conversation

dragly commented Dec 9, 2019

Pull Request Overview

Testing Strategy

Supporting Documentation and References

TODO or Help Wanted

gendx left a comment

Choose a reason for hiding this comment

dragly commented Dec 11, 2019

gendx commented Dec 11, 2019

dragly commented Dec 13, 2019

gendx commented Dec 16, 2019

bors bot commented Dec 16, 2019

Build succeeded