-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for non-zip archive Stores? #209
Comments
Regarding the zarr spec: The zarr v2 spec does not mention stores at all --- and in practice the supported stores vary greatly between implementations. In zarr v3 there may be some mention of stores but that does not preclude an implementation from supporting additional ones. I believe you can already use 7z archives with zarr-python via fsspec: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.libarchive.LibArchiveFileSystem However, I have not actually tried that myself. Regarding the size increase, I'm rather surprised that the size increases significantly --- I would expect only a minimal size increase, since as far as I am aware, the per file metadata in a zip file does not take up that much space. Only if your chunks are extremely small would I expect it to have a significant impact. In general,, for choosing an archive format, since the chunks can already be compressed by zarr, I would not expect it to matter much what compression options the archive format supports --- you can just use no compression. I would expect the compression provided by the archive to be particularly useful only if you are storing a lot of json metadata rather than chunk data. The main requirement for any archive format is the ability to read individual files efficiently. For example, tar is a poor choice because it only supports sequential access. |
I’ve toyed with python-zarr and 7z, and so far as I can tell, if you start
with a DirectoryStore and compress with 7z, you have to use the -tzip flag
to yield a file that is readable by python-zarr (and you have to name it
.zip), which is also much larger than the original DirectoryStore’s size,
presumably because the -tzip flag tells 7z to use zip as the archive format
rather than the 7z-native format.
On your surprise at my observation that zip can increase file sizes, I do
think that it’s the number of files and you’re right to mention chunk size
as I’m probably setting that rather non-optimally. I’m using zarr in a
setting where data is visualized in real-time as it’s collected by a
separate process from the process writing to zarr, and rather than send
over a queue, I just write to zarr. To optimize for latency then, I made my
chunks size 1 in the sample dimension, which therefore makes for lots of
chunk files. Seeing this failure mode amidst python-zarr’s zip-only
limitations, I should probably revert to sending data over a queue and make
the zarr chunks bigger.
If you want to play with an example data file of the type I’m working with,
I’ve uploaded one here:
[image: Zip Archive]
p10enrollment-112722_2022_11_…
<https://drive.google.com/file/d/1pcuaqqdebZopcL7pfsAQJH2fLwnKA4T-/view?usp=drivesdk>
It’s a DirectoryStore that’s been zipped using the point-and-click
“compress” built into Nautilus on Ubuntu (note: while they both presumably
use zip as a format, the files created using “compress” are about 2x bigger
than those created with 7z -tzip).
On Sun, Feb 19, 2023 at 3:02 AM Jeremy Maitin-Shepard < ***@***.***> wrote:
Regarding the zarr spec: The zarr v2 spec does not mention stores at all
--- and in practice the supported stores vary greatly between
implementations.
In zarr v3 there may be some mention of stores but that does not preclude
an implementation from supporting additional ones.
I believe you can already use 7z archives with zarr-python via fsspec:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.libarchive.LibArchiveFileSystem
However, I have not actually tried that myself.
Regarding the size increase, I'm rather surprised that the size increases
significantly --- I would expect only a minimal size increase, since as far
as I am aware, the per file metadata in a zip file does not take up that
much space. Only if your chunks are extremely small would I expect it to
have a significant impact.
In general,, for choosing an archive format, since the chunks can already
be compressed by zarr, I would not expect it to matter much what
compression options the archive format supports --- you can just use no
compression. I would expect the compression provided by the archive to be
particularly useful only if you are storing a lot of json metadata rather
than chunk data.
The main requirement for any archive format is the ability to read
individual files efficiently. For example, tar is a poor choice because it
only supports sequential access.
—
Reply to this email directly, view it on GitHub
<#209 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABEZ7PZULBT5BD6YAAAQVTWYHARHANCNFSM6AAAAAAVANJTVM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
…--
Mike Lawrence, PhD
Co-founder & Research Scientist
Axem Neurotechnology
axemneuro.com
~ Certainty is (usually) folly ~
|
@mike-lawrence - this is exactly one of the use cases that sharding (#134, #152) is designed to address. |
I took a look at your zip file --- the issue is that your chunks are way too small for efficient access or storage. Some of your chunks contain just a single 8-byte value. Zarr compresses each chunk individually, and no compression is possible for only 8 bytes. Blosc adds a 16 byte header, such that each chunk in that case is a 24 byte file (already tripling the size). But that ignores the per-file overhead required by the filesystem or archive. On most filesystems, files always consume a multiple of the block size, typically 4KB. So when using a local filesystem each of your 8 bytes of data is actually consuming 4KB. In a zip archive the file size won't be padded but there is still per-file overhead to store the filename, etc. Even with sharding I would still recommend a much larger chunk size, as most zarr implementations will have poor performance with such small chunks. |
Should move this issue to zarr-python? It doesn't seem like a spec issue |
Sure, the only reason I posted here is because the zarr-python issue page recommends putting feature requests here rather than there. |
Ah, silly me. I'd forgotten that I'd made all the arrays store in that one-sample-per-chunk mode, when only one was intended to be stored that way (and I should play to check if increasing the chunk size in that one even affects my real-time use case performance; I can't remember if I did that now). |
When data is initially collected as a DirectoryStore then compressed using
7z a -tzip ...
as suggested in the docs, the resulting zip file is larger (~4x) than the original .zarr directory, and substantially larger (~40x) than if compressed without the-tzip
flag (presumably thanks to zip's well-known issues with a large number of files?).Is it fundamentally not possible to support non-zip formatted archives (like 7zip's native format, or xz, or ..., etc)?
The text was updated successfully, but these errors were encountered: