-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pure C implementation #9
Comments
It will be great to have a pure C Zarr library. I really hope to see this happen. Here is a conceptual design issue that I have not sorted out. Here is how the Zarr spec defines a storage system:
This is fairly language agnostic. But in practice, the idea of a "key/value interface" is closely coupled to python's |
Thanks @rabernat. Great point. There are a number of ways we could imagine doing this. One might be to use a plugin architecture. This allows users to define their own particular mapping with some way to initialize, register, and destroy each plugin. Each plugin then can provide their own functions for standard operations on their specified storage type (opening/closing a storage format, getting/setting/deleting a key, etc.). To aggregate things a bit, we can imagine that a generic Though there are other options. Would be interesting to hear what others think about how best to implement this. Edit: Here's a great article that goes into more detail about plugin architecture. It also discusses C++ though the basic functionality can be emulated in C without C++. |
May be worth researching what tileDB does, which apparently can talk to various cloud back-ends also. |
Thanks for extracting this issue! One thing not to be overlooked is memory management. It is important that plugins never have to free memory allocated inside Zarr lib, and vice versa. While usually HDF5 solution is simple but flawed: plugins making use of H5allocate_memory(), H5resize_memory(), and H5free_memory() have to be linked against HDF5 library; but the actual name/location of run-time HDF5 lib, specifically on the most problematic systems (Windows), is never reliably known in advance. (Basically, for the same reasons as why there is no system-wide glibc). So, pre-built HDF5 compression plugins are of little use, as they are still only applicable to specific HDF5 instance, and not shareable between e.g. HDFView, HDF5 tools, client applications etc. Please, do not require Zarr plugins to link to Zarr C lib! What is the best way to correct design is yet to be discovered, but one could e.g. pass pointer to Another HDF5 C API problem that was at some point considered serious, but proved a non-issue practically: C |
The folks at quantstack have been working very hard at writing numerical array stuff in C++, xtensor for example. They might have some good ideas about how to approach this. They might even want to work on it! ;) Tagging @SylvainCorlay and @maartenbreddels to see if they want to weigh in. |
Hi @rabernat thanks for the heads up! Looking at It turns out that there already exists a project involving It could be interesting to discuss this in greater depth! We are holding a public developer meeting for xtensor and related projects this Wednesday (5pm Paris time, 11am EST, 8am PST). Would you like to join us? |
Hey, just to chime in here; as @SylvainCorlay mentioned, z5 / z5py (https://github.com/constantinpape/z5) is using xtensor as multiarray (and for python bindings). We also have an issue on C bindings (see constantinpape/z5#68) with some If you are considering to base the zarr C bindings on this:
|
@constantinpape @jakirkham is this something you would like to discuss at the xtensor meeting? |
@SylvainCorlay Unfortunately, I cannot make it today. |
@constantinpape we should arange a skype meeting. I would love to dive into the internals of z5! |
Sounds great! |
Forgive me for asking an ignorant question... For our purposes here, is C++ the same a C? Z5 already implements zarr in C++. Is an additional C implementation still necessary? |
It's Ok as long as the public interface is declared with |
Another point: a pure C implementation only needs a C compiler, not a C++. This might be a consideration for simplifying the build environment library clients. I’m thinking specifically about netcdf-c (ping @WardF) |
Thanks all for joining the conversation. Sorry for being slow to reply (was out sick most of last week). Part of the reason I raised this issue is feedback we got from PR ( zarr-developers/zarr-python#276 ) specifically this comment (though there may have been other comments that I have forgotten), which suggested that we need to supply a pure C implementation of the spec (not using C++). Without knowing the details about the motivation, I can't comment on that (hopefully others can fill this in). Though there are certainly a few reasons I can think of that one would want a pure C implementation. These could vary from providing an easily accessible API/ABI, working on embedded systems, interfacing with other libraries, portability, easy to build language bindings, organizational requirements, etc. Now there is certainly a lot of value in having a C++ implementation (and a very effective one at that), which I thinks @constantinpape and others have demonstrated. The C++ ecosystem is very rich and diverse; making it a great place to explore a spec like Zarr. Certainly it is possible to wrap a C++ library and use extern C, which we were discussing in issue ( constantinpape/z5#68 ). Though it sounds like that doesn't meet the needs of all of our community members. If that's the case, going with C should address this. Not to mention it would be easy for various new and existing implementations to build off of the C implementation either to quickly bootstrap themselves or lighten their existing maintenance burden. So overall I see this as a win for the community. |
@dopplershift and others are correct, for this to be something usable for netCDF, a pure C library is required. I suspect there are other projects out there which would benefit as well. The C++ interface is fantastic but wouldn't work for our needs. |
Do you have any thoughts on the technical design of this implementation, @WardF? Was thinking about using plugins to handle different |
No, but after discussion with @DennisHeimbigner I think we need to sketch out what is needed from the netCDF point of view, infer what would be needed from a more broad point of view, and see what the intersection is. I'll review the plugins link/guide that you linked to, thanks! I hadn't had a chance to review it yet. The focus of discussion on on our end internally have been around the intersection of the data model and an I/O API; I'll let Dennis make his own points here, but he has pointed out several things I hadn't previously considered. To answer your question, however, plugins do seem like a reasonable approach. In terms of using Codecs for compression/decompression, I infer it would be something similar to how HDF5 uses compression/decompression plugins for compressed I/O? That would make sense in broad terms; there are considerations in regards to the netCDF community and what 'core' compression schemes to support, but from a technical standpoint it seems like a reasonable, path-of-lesser-resistance approach. I will follow up with any additional thoughts on the technical design as they occur to me :). |
With respect to compression plugins. I would push for using the actual |
A question. I hear the term "IO API", but I cannot figure out what that term means. |
It is one appealing possibility, but please also be aware of HDF5 compression API drawbacks. |
I fail to see this as much of a problem from the netcdf-c point of view. |
@DennisHeimbigner Ah, it means you are lucky enough to not possess the peculiarities/ugliness of "Windows way" :-) Let's have HDF5 as the example. On GNU/Linux filter plugins are easy enough: you build them against the On my Windows laptop, I now have:
To what dll shall I link my plugin? Any answer will be sub-optimal, as it will reliably teach only that one program to understand your compressed data (and for This problem would not be present if plugin dlls didn't link to HDF5 dll. |
@aparamon Thank you for the illustrative example; this is something we will have to keep in mind, given the crossplatform nature of netCDF. |
What if we had an environment variable that acted as a plugin search path? Think all the major platforms have some C API for loading libraries at runtime. |
I see. You are correct. From the point of view of zarr embedded in netcdf-c |
I think HDF5 has such an variable named something like HDF5_PLUGIN_PATH? |
@DennisHeimbigner You are correct, HDF5 has Having memory allocation callbacks in Please note that optional (and rarely used) The mutual-reference architecture used in HDF5 seems rather un-elegant, but on *NIX systems it works fine almost always, due to single instance of The question is whether proper Windows support is worth the pain? As a one data point, for my company it would still be desirable, but much less so than say 3..5 years earlier. |
You are correct, I had it backwards. The filter needs to be given a table |
It appears to be the case that many (most) contributed HDF5 filters |
I am currently working on the insertion of C-Language Zarr support into Unidata's netcdf-c library. The goal is to provide an extended Zarr that will more closely support the existing netcdf-4 (aka netcdf enhanced) data model. We also intend to support two forms of interoperability with existing Zarr data and implementations.
I have created a draft document that is an early description of the extensions of Zarr to what I am calling NCZarr. Ideally, this could eventually become some kind of official standard Zarr extension. It also describes how we propose to represent existing Zarr datasets (our second goal). The document is currently kept here: |
Thanks for the update, @DennisHeimbigner. Will mull over this a bit. Have a few questions to start us off. Could you please share briefly in what ways the NCZarr spec differs (a very rough overview/executive summary here is fine as people can go read the spec for details)? Do these changes overlap with what other groups are looking for (e.g. https://github.com/zarr-developers/zarr/issues/333 and NeurodataWithoutBorders/pynwb#230 or others)? Were there specific pain points you encountered and/or areas, in which you were hoping Zarr could grow? It may even make sense to break these out into a series of GitHub issues that we can discuss independently. Though feel free to add them here first if that is easiest. Also would be good to hear a bit more about how you are handling things like different key-value stores and compression algorithms. Are users free to bring their own and if so how? Will there be any of either that are preincluded? |
I guess I had hoped this document would serve as that diff document :-)
Frankly I do not know in detail. My current goal is to get as
There are two such "pain points"
I considered starting a new issue, but do not want to pollute the issue space
I have separate internal architecture documents where I am describing In one of these issues, we discussed the Filter problem. My current thinking |
Thanks Dennis.
Just wanted to mention that zarr does support variable length strings, as
well as variable length sequences of atomic types. See the sections on
string arrays and object arrays in the tutorial.
For variable length string arrays I would recommend using the VLenUTF8
encoding, as it should be simplest to implement in plain C. IIRC it is
basically the same as parquet encoding.
…On Fri, 11 Jan 2019, 20:42 Dennis Heimbigner ***@***.*** wrote:
I am currently working on the insertion of C-Language Zarr support into
Unidata's netcdf-c library. The goal is to provide an extended Zarr that
will more closely support the existing netcdf-4 (aka netcdf enhanced) data
model.
We also intend to support two forms of interoperability with existing Zarr
data and implementations.
1. An existing Zarr implementation can read an NCZarr dataset.
2. NCZarr can provide read-only access to existing standard Zarr
datasets.
I have created a draft document that is an early description of the
extensions of Zarr to what I am calling NCZarr. Ideally, this could
eventually become some kind of official standard Zarr extension. It also
describes how we propose to represent existing Zarr datasets (our second
goal).
The document is currently kept here:
https://github.com/Unidata/netcdf-c/blob/cloud.tmp/docs/nczextend.md
It is currently in Doxygen markdown format, so there may be minor display
glitches depending on your viewer.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/zarr-developers/zarr/issues/317#issuecomment-453650594>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QsuHrf39Ta7ExzdYZY9kxwHpN5hwks5vCPc4gaJpZM4XztI9>
.
|
1. The Zarr spec does not conform the "write narrowly, read broadly"
heuristic in that it says that any annotations not specified in the
Zarr spec are prohibited. It preferably should say that unrecognized
keys/objects/etc should be ignored.
Apologies I haven't yet read the NCZarr spec carefully, sorry if I've
misread anything, but I though it might be worth a couple of comments here.
For an array at logical path "foo/bar", core metadata for the array are
stored as a JSON document under the key "foo/bar/.zarray", and attributes
are stored as a separate JSON document under the key "foo/bar/.zattrs".
As an extension developer, you are free to use the attributes (i.e. the
.zattrs JSON document) however you like. You can put whatever you like in
there. This is the natural place to put any netcdf information, in an
analogous way to how netcdf4 uses hdf5 attributes.
Similarly for the group at logical path "foo", there is a JSON document
under the key "foo/.zgroup" for core metadata, and another JSON document
under "foo/.zattrs" for attributes. Again, you are free to use the
attributes (.zattrs) document however you like.
I.e., the Zarr spec constrains the contents of .zarray and .zgroup
documents, but does not place any constraint on .zattrs. So you should be
able to do whatever you need with .zattrs, without requiring a spec change.
Also I saw your doc proposes putting some extra metadata in separate JSON
documents under other keys like .zdims and .ztypedefs. There is nothing
wrong with this, perfectly legal according to the standard zarr spec as is,
these keys would just get ignored by a standard zarr implementation.
However you could put everything into .zattrs, in which case it will be
visible as attributes via standard zarr APIs. I would suggest using .zattrs
unless you have a really strong reason not to.
Hope that makes sense, please feel free to follow up if anything isn't
immediately clear.
|
I must disagree. I am going from the spec. I consider the tutorial irrelevant.
I have considered putting some of my extensions as attributes, so that is still |
@alimanfoo Is there room to codify some of those items within the spec? Or is there something being misunderstood here? |
Well we are all feeling our way on this because we are in a complex design space, |
Sorry @DennisHeimbigner I must have missed it. What was the disagreement? |
This NCZarr spec seems like a great development! Surely lots of people will have ideas and opinions (pinging @shoyer and @jhamman who have previously weighed in on this, e.g. DOC: zarr spec v3: adds optional dimensions and the "netZDF" format #276) Regarding shared dimensions, we have already done something with this in an ad-hoc way in xarray. We chose to add an attributed called I know this isn't part of any spec. We just did it so we could get xarray to work with zarr. It may well turn out that this data can't be read properly by NCZarr (the price we pay for moving forward without a spec), but I thought I would at least mention this in hopes of some backwards compatibility. |
Thank you @DennisHeimbigner for bringing it up! Do I understand correctly that in order to locate a type definition, client is expected to walk up the directory/group structure parsing From reading Zarr spec it seems that unlike unknown attributes, unknown files (e.g. |
I'm guessing this is regarding storage of variable length strings (and other objects). The zarr Python implementation supports an object ('O') data type, but going back to the spec I see this is not mentioned anywhere. Apologies, this is an omission. The spec should state that an object ('O') data type is supported, and that this should be used for all variable length data types. When encoding an array with an object data type, there are various options for how objects are encoded. If objects are strings then my recommendation would be to use the VLenUTF8 encoding, defined in numcodecs.
I think the spec is reasonably clear about the fact that, within the .zarray metadata object, only certain keys are allowed. However, within the .zattrs metadata object you can use any key you like. And within the store, you can store other data under other keys like .zdims or whatever. However, I would still encourage using .zattrs for any extension metadata. This is what the xarray Sorry for brevity, happy to expand on any of this if helpful. |
So just to be clear, the word "key" here is being used in two different ways. There are keys within the .zarray metadata objects. And there are keys that are used to store and retrieve data from the store. |
The zarr spec constrains what keys you are allowed to use within .zarray and .zgroup metadata objects. But you can use any key you like within .zattrs metadata objects. And you can store other objects using store keys like ".ztypedefs", i.e., ".zarray" and ".zgroup" are reserved store keys, and you don't want to clash with chunk keys, but otherwise you can store data under any key you like. Although I would still recommend using .zattrs for metadata wherever possible. Hth. |
I have also been confused about how zarr supports variable length strings. I know the Python library can do it, but how such data is stored is not at all clear from reading the spec alone. |
Similarly to what I wrote in #276, I would prefer for both dimensions and named type definitions to be standardized for zarr, as optional metadata fields. NetCDF imposes some high level consistency requirements, but at least dimension names (without consistency requirements) are universal enough that they could be a first class part of zarr's data model. Potentially these specs could be layered, e.g.,
The reason why I'm advocating for laying these specs is that I think there's significant value in standardizing additional optional metadata fields. I think it would be much harder to convince non-geoscience users to use a full netCDF spec, but named dimensions alone would make their data much more self-described. We don't necessarily need to store these fields in
I agree that the Zarr spec should be updated in this way. In practice, it's hard for me to imagine a sensible implementation not adhering to this spec, but I still think it's good not to preclude the possibility of future (backwards compatible) extensions.
Agreed that this is important. Given that we'll need to increment the Zarr spec to include discussion of the |
Both I think. Yes definitely room to improve the spec in terms of clarifying anything implicit or not entirely clear. |
FWIW I'd like to edit the spec to explain the 'O' dtype, without incrementing the version number. It was implicit in the reference to numpy's dtypes, but it could be made explicit. I think it's just a clarification, not a backwards incompatible change. |
NumPy As one example of how this is a incompatible deviation, every dtype explicitly listed in Zarr's spec has fixed size. So an implementation of Zarr might reasonably assume that array elements always have fixed sized. |
If I understand this correctly then it is being proposed to include Python specific |
The question was raised if we could store NCzarr specific metadata as attributes rather than in e.g. .zdims. The answer is yes because that is exactly how it is done with the existing netcdf-4 over HDF5 implementation. One reason for this is because we had no way to access a lower level representation that can co-exist with the HDF5 storage format. That is not true for the Zarr spec. That spec pretty much exposes the underlying S3-like key-value-pair (aka KVP) storage format. So I exploited that exposure to store the NCZarr metadata at the KVP level. As an aside, the Zarr spec should probably be divided into two parts: (1) a specification of the Zarr data model and (2) a specification of the underlying KVP model and the mapping of Zarr to that model. Instead, the current spec combinges both 1 and 2 into a single document. It is not explicit about the KVP model, but it should be. In any case, ideally, NCZarr should extend Zarr, not the underlying KVP. This would suggest that using Zarr-level attributes is better than using KVP-level objects like .zdims. If one uses Zarr-level attributes to store netcdf-4 metadata (like our existing HDF5 implementation), then the cost is that if some other pure-Zarr reader looks at the NCZarr dataset, it will see a bunch of attributes that mean nothing to it. You can see this for netcdf-4 over HDF5 by doing an h5dump command on a netcdf-4 file. There a number of HDF5 objects that are netcdf-4 specific and are just irritating noise to any client reading the file as an HDF5 file. The saving grace is that it turn out to be uncommon for HDF5 code to be reading a netcdf-4 file. I am not sure if that is going to be the case with Zarr. One of the goals for NCZarr is to be able have existing Zarr readers access NCZarr datasets. So I expect that NCZarr datasets will be routinely read by pure-zarr software. So I do not know if this extra metadata will be more than a minor issue or not. |
It sounds like there are a few different needs that are diverging from the question of building a pure C implementation of Zarr. Would suggest we open some new issues for these specific needs. Does this sound reasonable? |
Ok, I am opening a NCZarr specific issue to continue discussion: |
NumPy O dtype stores references to Python objects. I don't think the
meaning of a "Python object" is at all obvious for a serialization format.
As one example of how this is a incompatible deviation, every dtype
explicitly listed in Zarr's spec has fixed size. So an implementation of
Zarr might reasonably assume that array elements always have fixed sized.
I think I'll raise a separate issue to unpack this in more detail.
|
If I understand this correctly then it is being proposed to include Python
specific
objects into the Zarr spec. Do I understand correctly?
No, we definitely want the spec to remain language agnostic. And it is
possible with the current zarr implementation to encode "object" arrays
(including arrays of variable length strings) in a way that is portable
across languages. But I think we do need to clarify the status and
semantics of the "object" data type in the v2 spec, as well as how it is
currently implemented in the Python zarr implementation. I'll break this
out into a separate issue.
|
I've created an issue on the new zarr spec repo to follow up discussion of the object data type: zarr-developers/zarr-specs#6 |
Great discussion! I am wondering if the interest in CAPI is still on? Also if anyone considered a zarr capi be compatible with the existing HDF5 CAPI. Curious to hear your opinion on both pros, cons. |
Since much of this discussion has now migrated to zarr-developers/zarr-specs#41, "NCZarr - Netcdf Support for Zarr", which is concerned with the finer details of supporting the full NetCDF format, I'd like to point out on this issue that as of 4.8.0 the NetCDF-C library does have support for Zarr: https://twitter.com/zarr_dev/status/1379875020059635713 If others have C libraries they would like to list, please feel free to add them. |
Raising based on feedback in a number of different issues (xref'd below). The suggestion is to implement a pure C Zarr library. There are some questions that range from where the code should live all the way down to details about the code (licensing, build system, binary package availability, etc.). Others include scope, dependencies, etc.. Am raising here to provide a forum focused on discussing these.
xref: https://github.com/zarr-developers/zarr/issues/285
xref: zarr-developers/zarr-python#276
xref: constantinpape/z5#68
The text was updated successfully, but these errors were encountered: