Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data compressed with compressor other than BloscLZ using 1.11.1 not decompressible with 1.7.0 #215

Closed
estan opened this issue Feb 14, 2018 · 22 comments

Comments

@estan
Copy link

estan commented Feb 14, 2018

In migrating from Ubuntu 16.04 to Ubuntu 17.10, we've run into a snag with libblosc.

It seems that data compressed with libblosc 1.11.1 (the version in Ubuntu 17.10) is not decompressible using libblosc 1.7.0 (the version in Ubuntu 16.04) if using a compressor != BloscLZ.

Below is a test program (modified version of examples/simple.c).

To reproduce the problem (using the LZ4 compressor here as an example):

  1. On Ubuntu 17.10 (libblosc 1.11.1), create a compressed file test.blosc:
estan@balder-101:~$ ./simple compress lz4 test.blosc
Blosc version info: 1.11.1 ($Date:: 2016-09-03 #$)
Compression: 4000000 -> 142070 (28.2x)
Wrote test.blosc
  1. Copy the file to an Ubuntu 16.04 (libblosc 1.7.0) machine and try to decompress it:
[estan@newton ~]$ ./simple decompress lz4 test.blosc 
Blosc version info: 1.7.0 ($Date:: 2015-07-05 #$)
Read test.blosc
Decompression error.  Error code: -1
[estan@newton ~]$

The same can be seen when trying the zlib compressor. However (!): Using the "blosclz" compressor when compressing the file, the decompression succeeds.

Info from the Ubuntu 16.04 system

[estan@newton ~]$ dpkg -l | egrep "libblosc|liblz4"
ii  libblosc-dev                                    1.7.0-1                                                  amd64        High performance meta-compressor optimized for binary data (development files)
ii  libblosc1                                       1.7.0-1                                                  amd64        High performance meta-compressor optimized for binary data
ii  liblz4-1:amd64                                  0.0~r131-2ubuntu2                                        amd64        Fast LZ compression algorithm library - runtime
ii  liblz4-dev:amd64                                0.0~r131-2ubuntu2                                        amd64        Fast LZ compression algorithm library - development files
[estan@newton ~]$

Info from the Ubuntu 17.10 system

estan@balder-101:~$ dpkg -l | egrep "libblosc|liblz4"
ii  libblosc-dev                                  1.11.1+ds2-2                             amd64        high performance meta-compressor optimized for binary data (development files)
ii  libblosc1                                     1.11.1+ds2-2                             amd64        high performance meta-compressor optimized for binary data
ii  liblz4-1:amd64                                0.0~r131-2ubuntu2                        amd64        Fast LZ compression algorithm library - runtime
ii  liblz4-tool                                   0.0~r131-2ubuntu2                        amd64        Fast LZ compression algorithm library - tool

Test Program (compile with gcc -o simple simple.c -lblosc)

#include <stdio.h>
#include <blosc.h>
#include <string.h>

#define SIZE 100*100*100

int main(int argc, char *argv[]){
  static float data[SIZE];
  static float data_out[SIZE];
  static float data_dest[SIZE];
  int isize = SIZE*sizeof(float), osize = SIZE*sizeof(float);
  int dsize = SIZE*sizeof(float), csize;
  int i;

  FILE *f;

  for(i=0; i<SIZE; i++){
    data[i] = i;
  }

  /* Register the filter with the library */
  printf("Blosc version info: %s (%s)\n", BLOSC_VERSION_STRING, BLOSC_VERSION_DATE);

  /* Initialize the Blosc compressor */
  blosc_init();

  /* Use the argv[2] compressor. The supported ones are "blosclz",
  "lz4", "lz4hc", "snappy", "zlib" and "zstd"*/
  blosc_set_compressor(argv[2]);

  if (strcmp(argv[1], "compress") == 0) {
    /* Compress with clevel=4 and shuffle active  */
    csize = blosc_compress(4, 1, sizeof(float), isize, data, data_out, osize);
    if (csize == 0) {
      printf("Buffer is uncompressible.  Giving up.\n");
      return 1;
    }
    else if (csize < 0) {
      printf("Compression error.  Error code: %d\n", csize);
      return csize;
    }

    printf("Compression: %d -> %d (%.1fx)\n", isize, csize, (1.*isize) / csize);

    /* Write data_out to argv[3] */
    f = fopen(argv[3], "wb+");
    if (fwrite(data_out, sizeof(float), SIZE, f) == SIZE) {
      printf("Wrote %s\n", argv[3]);
    } else {
      printf("Write failed");
    }
  } else {
    /* Read from argv[3] into data_out. */
    f = fopen(argv[3], "rb");
    if (fread(data_out, sizeof(float), SIZE, f) == SIZE) {
      printf("Read %s\n", argv[3]);
    } else {
      printf("Read failed");
    }

    /* Decompress  */
    dsize = blosc_decompress(data_out, data_dest, dsize);
    if (dsize < 0) {
      printf("Decompression error.  Error code: %d\n", dsize);
      return dsize;
    }

    printf("Decompression succesful!\n");
  }

  /* After using it, destroy the Blosc environment */
  blosc_destroy();

  return 0;
}
@ghost
Copy link

ghost commented Feb 14, 2018

git bisect showed that the offending commit is 9d06255.

After this commit, data compressed with BLOSC_LZ4HC, BLOSC_ZLIB or BLOSC_ZSTD is no longer decompressible using 1.7.0.

Was this commit supposed to maintain forward compatibility?

@estan
Copy link
Author

estan commented Feb 14, 2018

@FrancescAlted

@FrancescAlted
Copy link
Member

Yes, I can reproduce this and confirm that this change was consciously introduced and that I tried hard it to be backward compatible. However, and unfortunately, it did create a forward compatibility issue.

As your reported, all BLOSC_LZ4, BLOSC_LZ4HC, BLOSC_ZLIB and BLOSC_ZSTD codecs are affected. However, as ZSTD was very recently included (and out of beta in C-Blosc 1.11.0),and Zlib is not supposed to be widely used inside Blosc (other codecs like LZ4HC probably did a much better job), I suspect that the main codec affected in practice by this issue is LZ4/LZ4HC.

For the record, what I intended with this change was to avoid splitting the block for some codecs (both for better speed and better compression ratios). After extensive experimentation, I concluded that LZ4/LZ4HC, Zlib and Zstd were the codecs that benefited the most from not splitting. For keeping backward compatibility, I started to store whether the block was split or not in bit 4 of the flags field in header. A 0 means the block is split (the default for pre-1.11.0), while a 1 means it isn't. As Blosc pre-1.11.0 always did split blocks, it had not machinery for dealing with non-split blocks, so essentially there is no way to decompress Blosc post-1.11.0 buffers created with the aforementioned LZ4/LZ4HC, Zlib and Zstd codecs by using a Blosc pre-1.11 library.

With that, I am afraid that the only solution for reading files build with Blosc post-1.11 (and with codecs different than BloscLZ or Snappy) on machines with C-Blosc pre-1.11 on them is to update the Blosc library on the latter. FWIW, and for people using the shared library (the most common case I'd say) using the LD_PRELOAD trick would be a relatively easy way to fix this.

Although I am afraid that introducing these sort of issues is sort of unavoidable in software crafting, in the future I'll think twice before introducing forward incompatible changes. Sorry for the inconveniences this might have created. Perhaps it would be a good time to introduce a database of buffers created with different versions of Blosc and make sure that old Blosc versions can read the new versions (and the other way around, of course).

@estan
Copy link
Author

estan commented Feb 14, 2018

@FrancescAlted Thanks for the detailed and prompt response. I sort of suspected that was the case.

Would it be possible to introduce an API like e.g.

enum blosc_split_mode {
    auto_split = 1,
    never_split = 2,
    always_split = 3
};

int blosc_set_split_mode(blosc_split_mode mode);

where auto_split would be the current behavior and made the default. That way, producers that wish to upgrade to 1.11+ but want to maintain forward compatibility with < 1.11 can use always_split.

It won't help us right now on Ubuntu 17.10, but if such an API is added relatively soon and a new release made, as we migrate to Ubuntu 18.04 later in Spring, we'd have a chance of having it available there. In our case, we're using Blosc through a HDF5 plugin, which we could easily patch to use such an API.

For our current situation, I guess our only option is like you say: Either update Blosc on the reader side, or downgrade on the writer side. Both of them means we'll have to bundle our own Blosc, which is doable but a bit awkward.

Compatibility tests would definitely be a good idea. I can imagine many use cases where it's desirable (and easy) to update Blosc on the producing side (or as in our case just a side-effect of a system upgrade), but not so easy to update all consumers.

@estan
Copy link
Author

estan commented Feb 14, 2018

(And for full disclosure: We're using LZ4HC at level 4, since benchmarking showed this to be the best option for our use case).

@estan
Copy link
Author

estan commented Feb 14, 2018

Another option would of course be to make the old behavior (always split) the default, to maintain forward compatibility of older versions, and have new incompatible (but perhaps performance-enhancing) features be opt-in. This is the approach taken by e.g. HDF5. Excerpt from their page on backward/forward compatibility:

That is, files are written with the earliest version of the file format that
describes the information, rather than always using the latest version
possible. This provides the best forward compatibility by allowing the
maximum number of older versions of the library to read new files.

If library features are used that require new file format features, or if
the application requests that the library write out only the latest version
of the file format, the files produced with a newer version of the HDF5
Library may not be readable by older versions of the HDF5 Library. 

@FrancescAlted
Copy link
Member

Thanks for your comments. After thinking about this, I like your proposal of an API like:

enum blosc_split_mode {
    auto_split = 1,
    never_split = 2,
    always_split = 3
};

int blosc_set_split_mode(blosc_split_mode mode);

and set the default to always_split for allowing maximum forward compatibility. The user will need to adjust the split_mode for generally better compression ratios (and probably for better speed too), but this is less of a problem than allowing by default to create buffers that cannot be decoded with older versions of Blosc. Probably using an environment variable (e.g. BLOSC_SPLIT_MODE=['auto', 'never', 'always']) for setting this would be interesting too.

And I also agree that releasing a new Blosc library as soon as possible so as to maximize the likelihood to be included in forthcoming Ubuntu is important. I'll work on this as time allows.

@FrancescAlted
Copy link
Member

FrancescAlted commented Feb 15, 2018

After more consideration, I think that introducing a new mode called, say forward_compat_split, would be interesting for avoiding not splitting in LZ4/LZ4HC and Zlib, i.e. Zstd will continue to not split. The rational for this is that Zstd was introduced almost at the same time than the split flag, so this should not have any practical effect in forward compatibility. This new mode should be the default.

@estan
Copy link
Author

estan commented Feb 15, 2018

@FrancescAlted That sounds very reasonable - make the default as good performance-wise as possible, without sacrificing forward-compatibility.

@FrancescAlted
Copy link
Member

PR #216 introduces a new blosc_set_splitmode() function. @estan could you have a try at it and tell me if that would alleviate your problem at hand? Also, does this reduce performance for your case?

@estan
Copy link
Author

estan commented Feb 15, 2018

@FrancescAlted Great! I can do some ad-hoc tests on our data when I'm back at work tomorrow. We've always been happy with the performance we got from 1.7.0 though, so for our case, always splitting will be business as usual performance-wise. But I can try doing a comparison of the modes.

@estan
Copy link
Author

estan commented Feb 16, 2018

@FrancescAlted Didn't have much time today, but did a quick'n'dirty comparison between BLOSC_SPLITMODE=NEVER and BLOSC_SPLITMODE=ALWAYS for compression/decompression of our data. There was no significant difference between the two modes.

Our data is 32 bit float values, quite a lot of zeroes. The non-zero areas are fairly smooth, with values around 0.1-20.0. The numbers below are from compressing/decompressing a 50 MB chunk of in-memory data to/from a ramdisk, using LZ4HC at level 4.

NEVER ALWAYS
Compress 0.041 s +/- 0.0025 s 0.040 s +/- 0.0021 s
Decompress 0.021 s +/- 0.0015 s 0.022 s +/- 0.0019 s

@FrancescAlted
Copy link
Member

@estan Thanks. When I say performance I normally include the compression ratio metric and not only the speed. In fact, I was expecting a bigger change in the former. Have you noticed a change in this?

@estan
Copy link
Author

estan commented Feb 16, 2018

@FrancescAlted Aha. The difference in size seems minuscule for this particular file. Here are the compressed file sized for NEVER and ALWAYS:

-rw-rw-r-- 1 estan estan 28409877 feb 16 14:03 rec-0660mm-0680mm.hdf5.always
-rw-rw-r-- 1 estan estan 28603051 feb 16 14:02 rec-0660mm-0680mm.hdf5.never

@FrancescAlted
Copy link
Member

Great. It is curious because splitting a block seems beneficial for your scenario (which is a bit surprising for LZ4HC), but in general that will largely depend on the specific dataset.

@estan
Copy link
Author

estan commented Feb 16, 2018

@FrancescAlted Yes, maybe it also has to do with the fact that HDF5 datasets are chunked? The compressor will only ever work with one HDF5 chunk at a time. In this particular test I used a 40x40x40 HDF5 chunk size (normally we let h5py decide upon an appropriate chunk size for us, but this time I used h5repack to compress the HDF5 file, so I chose it manually). Not sure how HDF5's chunking affects the compression ratio when using Blosc.

In any case, I can say for sure that we're happy with always doing splitting, especially since it allows us to maintain forward compatibility of already deployed readers that are using 1.7.0.

@FrancescAlted
Copy link
Member

Yeah, in the h5repack test you are using a chunk of less than 256 KB, and that means that Blosc is not splitting anything (for the LZ4HC codec and clevel 4 the minimum blocksize to start splitting is 256 KB). This explains why you are not getting too much differences between splitting and not splitting. Maybe a much better test would be to use a chunksize of 1 MB or so.

@estan
Copy link
Author

estan commented Feb 16, 2018

@FrancescAlted Aha, yes that explains it. Though the 40x40x40 is something I picked to be roughly in the same size as h5py would auto-pick, which is what we use in production. We don't want to have too large chunk sizes, as that would mean more data having to be read/decompressed than strictly necessary when selecting hyperslabs from the datasets. But we've never done any thorough testing of different chunk sizes for our use cases. I guess we should do this at some point, but time is precious :)

@FrancescAlted
Copy link
Member

For the record, I thought that sharing this issue in a more public way would be useful for the community, so I blogged about this: http://blosc.org/posts/new-forward-compat-policy/.

@FrancescAlted
Copy link
Member

Merged #216 into master. Closing this.

@estan
Copy link
Author

estan commented Mar 13, 2018

@FrancescAlted I submitted an updated package to the Debian maintainer and he was kind enough to package it quickly on his own. Also filed an Ubuntu bug asking for a feature freeze exception and sync of the Debian package, which they granted straight away. So 1.14 will be in Ubuntu 18.04 LTS.

Many thanks for being so prompt with this issue!

@FrancescAlted
Copy link
Member

Cool! Thanks for dealing with distributions 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants