-
-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data compressed with compressor other than BloscLZ using 1.11.1 not decompressible with 1.7.0 #215
Comments
After this commit, data compressed with Was this commit supposed to maintain forward compatibility? |
Yes, I can reproduce this and confirm that this change was consciously introduced and that I tried hard it to be backward compatible. However, and unfortunately, it did create a forward compatibility issue. As your reported, all BLOSC_LZ4, BLOSC_LZ4HC, BLOSC_ZLIB and BLOSC_ZSTD codecs are affected. However, as ZSTD was very recently included (and out of beta in C-Blosc 1.11.0),and Zlib is not supposed to be widely used inside Blosc (other codecs like LZ4HC probably did a much better job), I suspect that the main codec affected in practice by this issue is LZ4/LZ4HC. For the record, what I intended with this change was to avoid splitting the block for some codecs (both for better speed and better compression ratios). After extensive experimentation, I concluded that LZ4/LZ4HC, Zlib and Zstd were the codecs that benefited the most from not splitting. For keeping backward compatibility, I started to store whether the block was split or not in bit 4 of the flags field in header. A 0 means the block is split (the default for pre-1.11.0), while a 1 means it isn't. As Blosc pre-1.11.0 always did split blocks, it had not machinery for dealing with non-split blocks, so essentially there is no way to decompress Blosc post-1.11.0 buffers created with the aforementioned LZ4/LZ4HC, Zlib and Zstd codecs by using a Blosc pre-1.11 library. With that, I am afraid that the only solution for reading files build with Blosc post-1.11 (and with codecs different than BloscLZ or Snappy) on machines with C-Blosc pre-1.11 on them is to update the Blosc library on the latter. FWIW, and for people using the shared library (the most common case I'd say) using the LD_PRELOAD trick would be a relatively easy way to fix this. Although I am afraid that introducing these sort of issues is sort of unavoidable in software crafting, in the future I'll think twice before introducing forward incompatible changes. Sorry for the inconveniences this might have created. Perhaps it would be a good time to introduce a database of buffers created with different versions of Blosc and make sure that old Blosc versions can read the new versions (and the other way around, of course). |
@FrancescAlted Thanks for the detailed and prompt response. I sort of suspected that was the case. Would it be possible to introduce an API like e.g. enum blosc_split_mode {
auto_split = 1,
never_split = 2,
always_split = 3
};
int blosc_set_split_mode(blosc_split_mode mode); where It won't help us right now on Ubuntu 17.10, but if such an API is added relatively soon and a new release made, as we migrate to Ubuntu 18.04 later in Spring, we'd have a chance of having it available there. In our case, we're using Blosc through a HDF5 plugin, which we could easily patch to use such an API. For our current situation, I guess our only option is like you say: Either update Blosc on the reader side, or downgrade on the writer side. Both of them means we'll have to bundle our own Blosc, which is doable but a bit awkward. Compatibility tests would definitely be a good idea. I can imagine many use cases where it's desirable (and easy) to update Blosc on the producing side (or as in our case just a side-effect of a system upgrade), but not so easy to update all consumers. |
(And for full disclosure: We're using |
Another option would of course be to make the old behavior (always split) the default, to maintain forward compatibility of older versions, and have new incompatible (but perhaps performance-enhancing) features be opt-in. This is the approach taken by e.g. HDF5. Excerpt from their page on backward/forward compatibility:
|
Thanks for your comments. After thinking about this, I like your proposal of an API like:
and set the default to And I also agree that releasing a new Blosc library as soon as possible so as to maximize the likelihood to be included in forthcoming Ubuntu is important. I'll work on this as time allows. |
After more consideration, I think that introducing a new mode called, say |
@FrancescAlted That sounds very reasonable - make the default as good performance-wise as possible, without sacrificing forward-compatibility. |
@FrancescAlted Great! I can do some ad-hoc tests on our data when I'm back at work tomorrow. We've always been happy with the performance we got from 1.7.0 though, so for our case, always splitting will be business as usual performance-wise. But I can try doing a comparison of the modes. |
@FrancescAlted Didn't have much time today, but did a quick'n'dirty comparison between Our data is 32 bit float values, quite a lot of zeroes. The non-zero areas are fairly smooth, with values around 0.1-20.0. The numbers below are from compressing/decompressing a 50 MB chunk of in-memory data to/from a ramdisk, using
|
@estan Thanks. When I say performance I normally include the compression ratio metric and not only the speed. In fact, I was expecting a bigger change in the former. Have you noticed a change in this? |
@FrancescAlted Aha. The difference in size seems minuscule for this particular file. Here are the compressed file sized for
|
Great. It is curious because splitting a block seems beneficial for your scenario (which is a bit surprising for LZ4HC), but in general that will largely depend on the specific dataset. |
@FrancescAlted Yes, maybe it also has to do with the fact that HDF5 datasets are chunked? The compressor will only ever work with one HDF5 chunk at a time. In this particular test I used a 40x40x40 HDF5 chunk size (normally we let In any case, I can say for sure that we're happy with always doing splitting, especially since it allows us to maintain forward compatibility of already deployed readers that are using 1.7.0. |
Yeah, in the h5repack test you are using a chunk of less than 256 KB, and that means that Blosc is not splitting anything (for the LZ4HC codec and clevel 4 the minimum blocksize to start splitting is 256 KB). This explains why you are not getting too much differences between splitting and not splitting. Maybe a much better test would be to use a chunksize of 1 MB or so. |
@FrancescAlted Aha, yes that explains it. Though the 40x40x40 is something I picked to be roughly in the same size as |
For the record, I thought that sharing this issue in a more public way would be useful for the community, so I blogged about this: http://blosc.org/posts/new-forward-compat-policy/. |
Merged #216 into master. Closing this. |
@FrancescAlted I submitted an updated package to the Debian maintainer and he was kind enough to package it quickly on his own. Also filed an Ubuntu bug asking for a feature freeze exception and sync of the Debian package, which they granted straight away. So 1.14 will be in Ubuntu 18.04 LTS. Many thanks for being so prompt with this issue! |
Cool! Thanks for dealing with distributions 👍 |
In migrating from Ubuntu 16.04 to Ubuntu 17.10, we've run into a snag with libblosc.
It seems that data compressed with libblosc 1.11.1 (the version in Ubuntu 17.10) is not decompressible using libblosc 1.7.0 (the version in Ubuntu 16.04) if using a compressor != BloscLZ.
Below is a test program (modified version of
examples/simple.c
).To reproduce the problem (using the LZ4 compressor here as an example):
test.blosc
:The same can be seen when trying the
zlib
compressor. However (!): Using the"blosclz"
compressor when compressing the file, the decompression succeeds.Info from the Ubuntu 16.04 system
Info from the Ubuntu 17.10 system
Test Program (compile with
gcc -o simple simple.c -lblosc
)The text was updated successfully, but these errors were encountered: