Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy savez_compressed much smaller filesizes for small arrays #116

Open
lopsided opened this issue Jan 18, 2022 · 5 comments
Open

numpy savez_compressed much smaller filesizes for small arrays #116

lopsided opened this issue Jan 18, 2022 · 5 comments

Comments

@lopsided
Copy link

I have a few million images to save to disk and have been trying a few options out. I thought blosc/bloscpack would be well suited but I'm getting far larger image sizes than using the standard numpy savez_compressed.

My images are size (3,200,200) and dtype=float32. Typical file sizes I'm getting are:

  • np.savez ~ 470k
  • np.savez_compressed ~ 53k
  • blosc.pack_array ~ 200k
  • blosc.compress_ptr ~ 200k
  • bloscpack.pack_ndarray_to_file ~ 200-400k

For a sample of 370 images this gives:

67M      ./blosc_packarray
67M      ./blosc_pointer
121M     ./bp
19M      ./npz
172M     ./uncompressed

For the blosc_* methods I'm writing the packed bytes like:

with open(dest, 'wb') as f:
            f.write(packed)

Is there anything I'm missing or is numpy's compression just as good as it gets for small images like these?

@esc
Copy link
Member

esc commented Jan 18, 2022

@lopsided thank you for asking about this. What settings are you using for Blosc and bloscpack. Maybe you need to either use a higher compression setting (like 9) and/or change the internal algorithm? I think it could be worth a shot.

@esc
Copy link
Member

esc commented Jan 18, 2022

@lopsided a list of settings to explore is here: https://github.com/Blosc/bloscpack#settings

If you can share the data or an anonymized variant that has similar entropy we could look into this in more detail.

@lopsided
Copy link
Author

Thanks for the quick reply!

I've just been using pretty much default settings:

packed = blosc.compress_ptr(
    address=images.__array_interface__['data'][0],
    items=images.size,
    typesize=images.dtype.itemsize,
    clevel=9,
    shuffle=blosc.SHUFFLE
)
packed = blosc.pack_array(images)
bp.pack_ndarray_to_file(images, dest)

I've attached an example image (actually a triplet of greyscale images), saved uncompressed using np.savez. (I had to rename it to .zip to make github happy).

000000.zip

@esc
Copy link
Member

esc commented Jan 18, 2022

Thanks for the quick reply!

Thank you, it may take me a few days to tinker.

@esc
Copy link
Member

esc commented Jan 27, 2022

I am so sorry, but there was no space left in my schedule to look into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants