Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zstd decompression fails due to unknown frame content size #625

Open
mkitti opened this issue Jul 25, 2024 · 4 comments · May be fixed by #639
Open

Zstd decompression fails due to unknown frame content size #625

mkitti opened this issue Jul 25, 2024 · 4 comments · May be fixed by #639

Comments

@mkitti
Copy link

mkitti commented Jul 25, 2024

Neuroglancer appears to be having difficulty decompressing a Zstd compressed dataset:
https://neuroglancer-demo.appspot.com/#!https://s3proxy.janelia.org/shroff-public/shroff_20220523_Janelia_embryo2/20220523i_Janelia_embryo2_OTO_6x6x6nm_2xbin.zstd.zarr/s6/ng.json

The Google Chrome developer tools console reports the following error:

Error retrieving chunk [object Object]:0,0,0: Error: Invalid typed array length: 4294967226

Raw or Gzip compressed Zarr v2 datasets seem to load just fine:

Raw compressed Zarr v2 dataset:
https://neuroglancer-demo.appspot.com/#!https://s3proxy.janelia.org/shroff-public/shroff_20220523_Janelia_embryo2/20220523i_Janelia_embryo2_OTO_6x6x6nm_2xbin.raw.zarr/s6/ng.json

Gzip compressed Zarr v2 dataset:
https://neuroglancer-demo.appspot.com/#!https://s3proxy.janelia.org/shroff-public/shroff_20220523_Janelia_embryo2/20220523i_Janelia_embryo2_OTO_6x6x6nm_2xbin.gzip.zarr/s6/ng.json

The value 4294967226 corresponds to the 32-bit value 0xFFFFFFBA. The 64-bit value 0xFFFFFFFFFFFFFFBA corresponds to the Zstd error code for "Destination buffer is too small".

The dataset was encoded by Tensorstore 0.1.62 via conda-forge package with build number py312h7e2185d_0 and zstd version 1.5.6 ha6fb4c9_0

@mkitti
Copy link
Author

mkitti commented Jul 25, 2024

This Zstandard compression works:
https://neuroglancer-demo.appspot.com/#!https://s3proxy.janelia.org/shroff-public/shroff_20220523_Janelia_embryo2/20220523i_Janelia_embryo2_OTO_6x6x6nm_2xbin.zstd.zarr.mark/s6/ng.json

The problem is that ZSTD_getFrameContentSize returns 0xffffffffffffffff for the chunk encoded by Tensorstore. This is because the streaming API is used and ZSTD_CCtx_setPledgedSrcSize is not used to hint the content size. Therefore, the content size is not encoded in the header. The following chunk at the URL contain does not have a encoded content size in the header.
https://neuroglancer-demo.appspot.com/#!https://s3proxy.janelia.org/shroff-public/shroff_20220523_Janelia_embryo2/20220523i_Janelia_embryo2_OTO_6x6x6nm_2xbin.zstd.zarr.mark/s6/0/0/0.bad

A chunk with a correctly encode content size in the header is located at the following URL:
https://neuroglancer-demo.appspot.com/#!https://s3proxy.janelia.org/shroff-public/shroff_20220523_Janelia_embryo2/20220523i_Janelia_embryo2_OTO_6x6x6nm_2xbin.zstd.zarr.mark/s6/0/0/0.good

numcodecs.js does not handle the return value of Zstd_getFrameContentSize properly. It passes the unchecked value directly to malloc.
https://github.com/manzt/numcodecs.js/blob/b9a8ca932240a6de9ad6c00f9820ab5e6bbca4c1/codecs/zstd/zstd_codec.cpp#L28-L29

Suggested actions for remediation:

  1. Neuroglancer should communicate the expected chunk size to decoding codecs. Chunk data should not be able to dictate the destination buffer size without constraint.
  2. numcodecs.js should correctly handle ZSTD_CONTENTSIZE_UNKNOWN and ZSTD_CONTENTSIZE_ERROR rather than just passing the result to malloc.
  3. Tensorstore should encode the content size in the Zstd header by using ZSTD_CCtx_setPledgedSrcSize

@mkitti mkitti changed the title Zarr v2 zstd decompression Zstd decompression fails due to unknown frame content size Jul 26, 2024
@jbms
Copy link
Collaborator

jbms commented Jul 26, 2024

Thanks for the investigation!

Suggested actions for remediation:

  1. Neuroglancer should communicate the expected chunk size to decoding codecs. Chunk data should not be able to dictate the destination buffer size without constraint.

Agreed in the typical case when the size is known, but with zarr v3 chained codecs, or future support for variable-length strings, that isn't always possible.

  1. numcodecs.js should correctly handle ZSTD_CONTENTSIZE_UNKNOWN and ZSTD_CONTENTSIZE_ERROR rather than just passing the result to malloc.

Agreed.

  1. Tensorstore should encode the content size in the Zstd header by using ZSTD_CCtx_setPledgedSrcSize

Agreed when possible.

@mkitti
Copy link
Author

mkitti commented Aug 14, 2024

manzt/numcodecs.js#47 was merged earlier today and released as v0.3.2. This includes Zstd stream decompression capability that does not require a known frame content size in the Zstd header.

@mkitti
Copy link
Author

mkitti commented Sep 5, 2024

@jbms I helped to address point 2 above by implementing streaming decompression in numcodecs.js 0.3.2. Bumping the version as in #639 would help alleviate the immediate problem of neuroglancer not being able to read some datasets created by tensorstore.

Let me know if you think more needs to be done there. Next, I was going to try to dig into point 3 to figure out why the pledged size is not making it into the frame header when tensorstore uses Zstandard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants