Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression failure caused by incorrect ChunkSize #208

Open
allwaysFindFood opened this issue Aug 2, 2023 · 2 comments
Open

Compression failure caused by incorrect ChunkSize #208

allwaysFindFood opened this issue Aug 2, 2023 · 2 comments

Comments

@allwaysFindFood
Copy link

In the Mat_H5GetChunkSize function in the mat73.c file...


Mat_H5GetChunkSize(size_t rank, hsize_t *dims, hsize_t *chunk_dims)
{
    hsize_t i, j, chunk_size = 1;
    for ( i = 0; i < rank; i++ ) {
        chunk_dims[i] = 1;
        for ( j = 4096 / chunk_size; j > 1; j >>= 1 ) {
            if ( dims[i] >= j ) {
                chunk_dims[i] = j;
                break;
            }
        }
        chunk_size *= chunk_dims[i];
    }
}

It appears that the intention of the code is to find the optimal Chunk size that maximizes compression efficiency.
In reality, the above code results in the compressed file size being doubled.

Let's consider you have a dataset with dimensions (17, 1000) and you are storing this data using HDF5 with compression enabled.
The optimal ChunkSize obtained from the code mentioned above is ChunkSize = (16, 512)

In the case with a ChunkSize of (16, 512), the data is divided into chunks of (16, 512). Since your data's dimension along the first axis is 17, it means you will need 2 chunks along that axis. Along the second axis, which has dimension 1000, you will need 2 chunks as well. So, in total, you will have 4 chunks to store the entire data. However, because the chunk size is not an exact fit for the data dimensions, there will be some extra space in each chunk that is not used efficiently. This can lead to an increase in the file size.

In the second case, ChunkSize = (9, 512). Although you still need a total of 4 chunks to store the data. However, because each chunk occupies less space. This results in a more efficient storage utilization and a smaller file size compared to the first case.

In practical testing, when continuously appending data with dimensions of (17, 1000), the sizes of the two files differ by approximately a factor of two, aligning with the initial hypothesis.

@tbeu
Copy link
Owner

tbeu commented Aug 8, 2023

Thanks for bringing this topic up. I evaluated the code of Mat_H5GetChunkSize and also compared it with the auto-chunk feature of h5py.

I see that Mat_H5GetChunkSize always sets the chunk dimensions to powers of 2 where the maximal chunk size is 4096 elements. This indeed might be inappropriate for certain/many cases.

Let's consider you have a dataset with dimensions (17, 1000) and you are storing this data using HDF5 with compression enabled.
The optimal ChunkSize obtained from the code mentioned above is ChunkSize = (16, 512)

It is (16, 256), right? But it does not influence your follow-up reasoning.

What is your proposal?

  1. Improve Mat_H5GetChunkSize in the same way as guess_chunk of h5py.
  2. Offer an public API to manually set the chunk size for datasets of HDF5 MAT variables.
  3. Keep as is, but document better.
  4. Increase the maximal chunk size from 4096 to some higher value.

Thanks again for your feedback.

@tbeu
Copy link
Owner

tbeu commented Jan 6, 2024

@allwaysFindFood Any feedback would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants