-
Notifications
You must be signed in to change notification settings - Fork 15
Add ZEP 8 (URL syntax) draft #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@normanrz Please take a look. |
|
@MSanKeys963 Looks like there is an issue with the docs build that is unrelated to this PR. |
|
@martindurant Would appreciate your perspective on this --- I imagine you might say that we should just use fsspec syntax instead, though. |
|
Well indeed, I could say "why invent another"; although translating between |
|
While standardizing a URL scheme has benefits on its own, I think the main benefit/motivation for this ZEP is the formalization of Zip stores. Essentially, to comply with this ZEP, implementations need to implement zip stores. Maybe that should be written out more explicitly? |
While this ZEP was prompted by our discussion about zip stores, my intention was that we standardize on the syntax for various protocols, but that implementations would choose which ones to support. I think we could also push implementations to support zip format, but I'm not sure I want to tie that to this URL syntax proposal. |
|
@bogovicj this might also be relevant for your OME transformations proposal. |
@jbms: I have added #51 to fix the RTD build. Can you please update your PR? |
|
Thanks @jbms for putting this together! There are a few situations I came up with for which I'm not sure what the What does it look like to use Base URL: Is it correct / valid to use Base URL: If one needs to add an adapter in a relative way, how does one go about it? Base URL: Which, if any, of these do you think should be used? Are any of these invalid?
|
|
One more thing: We've found it useful to be able to reference a particular part of the attributes stored in json this zarr3 zarr.json
Could you envision adding an For example: A specific use case: I often re-use and reference transformations. Since these are described by metadata (not arrays), For example, if this were adopted, something like this would not uncommon in my workflows: |
|
On Tue, Nov 14, 2023, 05:53 John Bogovic ***@***.***> wrote:
Thanks @jbms <https://github.com/jbms> for putting this together! There
are a few situations I came up with for which I'm not sure what the
relative URL should be
What does it look like to use ..: to "go up" multiple levels?
Is this correct / valid?
Base URL: gs://bucket/0.zip|zip:a|zarr3:i
Relative URL: ..:..:1.zip|zip:b|zarr3:ii
Resolved URL: gs://bucket/1.zip|zip:b|zarr3:ii
I was imagining that the relative url would be:
`|..|..:1.zip|zip:b|zarr3:ii`
The part after the | is always the scheme, and a scheme of .. is needed to
get to the parent store.
Is it correct / valid to use .. in the "path part" of relative URL, after
a ..:?
Base URL: gs://bucket/0/a/i.zarr|zarr3:foo
Relative URL: ..:../b/i.zarr|zarr3:foo
Resolved URL: gs://bucket/0/b/i.zarr|zarr3:foo
If one needs to add an adapter in a relative way, how does one go about it?
For example:
Base URL: gs://bucket/0/a/i.zarr Desired Resolved URL:
gs://bucket/0/a/i.zarr|zarr3:foo`
Which, if any, of these do you think should be used? Are any of these
invalid?
- .|zarr3:foo (clearest to me)
- |zarr3:foo
- zarr3:foo
I was imagining `|zarr3:foo`
The existing standard interpretation of a relative url of `.` means to
strip everything after the last slash, and we should be consistent with
that. Therefore if the base url were specified as
`gs://bucket/0/a/i.zarr/` then `.|zarr3:foo` would also be valid, but
probably should not be preferred.
…
—
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAEJ2TUR5G466LQFB4DE63YENZUBAVCNFSM6AAAAAA4R5AJVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJQGI2DIMZQG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
On Tue, Nov 14, 2023, 07:21 John Bogovic ***@***.***> wrote:
One more thing:
We've found it useful to be able to reference a particular part of the
attributes stored in json
with a URL. For example, for
this zarr3 zarr.json
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"dimension_names": ["rows", "columns"],
"data_type": "float64",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "gzip",
"configuration": {
"level": 1
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
- /attributes/baz[0] points to 1
- /shape points to [10000, 1000]
- /chunk_grid/configuration points to { "chunk_shape": [1000, 100] }
Could you envision adding an attributes: or zarr.json:, or similar
adapter, that enaables this?
Yes, having a scheme for accessing an attribute sounds like a good idea.
One option would be a specific scheme for zarr attributes, like zarr3a, e.g:
"gs://bucket/0.zip|zip:a|zarr3:i|zarr3a:/foo"
or
"gs://bucket/0.zip|zip:a/i|zarr3a:/foo"
Another option would be a json scheme for accessing any json file, e.g.:
"gs://bucket/0.zip|zip:a|zarr3:i/zarr.json|json:/attributes/foo"
Then there is the question of what syntax to use for specifying the path
within the json document. A natural choice would be the existing json
pointer syntax (https://datatracker.ietf.org/doc/html/rfc6901), e.g.
"/transform/1". The json pointer syntax does use an unusual escaping
syntax for handling member names containing "/": for example, if you have
an object like:
{"foo/bar": 10. "foo~bar": 11}
then to access the 10 value you use a json pointer of "/foo~1bar", and to
access the 11 value you use a json pointer of "/foo~0bar".
In my opinion this escaping mechanism is rather unfortunate since it is
easy to forget the meaning of "~0" and "~1", but it isn't an issue if you
can avoid using "/" or "~" in member names.
… For example: gs://bucket/0.zip|zip:a|zarr3:i|zarr.json:attributes/foo
A specific use case: I often re-use and reference transformations. Since
these are described by metadata (not arrays),
and so referencing the specific metadata is helpful.
For example, if this were adopted, something like this would not uncommon
in my workflows:
{
"type" : "sequence",
"transformations" : [
{ "url" : "..:/localTransformations|zarr.json:/transform[1]" },
{ "url" : "gs://bucket/path/to/templateTransformation.zarr|zarr3:sharedTransforms|zarr.json:/transform[0]" },
]
}
—
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAEJ2TAPT5G4BH5TRGA2TDYEOD5ZAVCNFSM6AAAAAA4R5AJVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJQGQ2DCMZYGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
|
From today's Zarr community meeting, @jbms has implemented this ZEP in Neuroglancer. Check here: google/neuroglancer#696 |
This is in line with zarr-developers/zeps#48 and the syntax supported by Neuroglancer. Currently, zip is supported. OCDBT support will be added in a subsequent commit. PiperOrigin-RevId: 755691199 Change-Id: Ia6cb84c12a986a7dd0ba65e41454fbe6d415aed0
|
@jbms: I tried pushing a merge of origin to try fixing the build, but was rejected. Could you give it a try? |
ianhi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a thorough read of this to understand it and I have left some comments with a few typo fixes.
I also left comments on parts that took me a decent bit of work to understand, or that I don't fully understand in the hope that it's a helpful perspective. I'd rate myself as a competent but not expert reader of a document like this
Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
Co-authored-by: Sanket Verma <svsanketverma5@gmail.com> Co-authored-by: Ian Hunt-Isaak <ianhuntisaak@gmail.com> Co-authored-by: Joe Hamman <jhamman1@gmail.com>
Thanks very much for your review. Based on your comments I made some significant revisions and would appreciate feedback. Based on my revisions it occurs to me that this may be better as an independent standard, and the zarr spec could just recommend that implementations support it. |
ianhi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates, I found it significantly easier to understand on this close reading. I've left a few more comments on the few remaining areas where I found myself confused.
|
|
||
| - `gs://bucket/path/to/data|byte-range:1000-2000` | ||
|
|
||
| - `tiff:`, `jpeg:`, `png:`, `bmp:`, `avif:`, `webp:` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both these and byte-range, I don't know how something like zarr-python is meant to handle this. Surely zarr can't be responsible for reading different image formats?
This feels like it gets to your point:
Based on my revisions it occurs to me that this may be better as an independent standard,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, some of these aren't relevant to a zarr implementation at the moment. But note that they might be relevant in the context of the proposed chunk manifest or other virtual array proposals. In particular the byte-range URL scheme could eliminate the need to separately specify offset and length inside a chunk manifest.
This is in line with zarr-developers/zeps#48 and the syntax supported by Neuroglancer. Currently, zip is supported. OCDBT support will be added in a subsequent commit. PiperOrigin-RevId: 755691199 Change-Id: Ia6cb84c12a986a7dd0ba65e41454fbe6d415aed0
draft/ZEP0008.md
Outdated
| - `https://example.com/path/to/archive.zip|zip|zarr3` is equivalent to | ||
| `https://example.com/path/to/archive.zip|zip:|zarr3:`. | ||
|
|
||
| It is expected that additional URL schemes may be standardized in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
-
Just schemes (i.e. root url) or schemes + adapters
-
Where would this happen?
. Soemthing I'm already seeing that I want is a log: (for a logging store wrapper) or a latency:150 (for a zarr-python latency store with 150 ms latency).
Which also gets to the fact that adapter define both or either of: a path, or metadata about how to read (e.g. byte range/icechunk branch). Woudl it be good to consider the general case of what an adapter can be, and how it specifies both or either of path and metadata here, or is that overkill given that there is a limited set here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment --- yes, both root and adapter schemes.
In general an adapter can be any transformation of the resource. Logically, given some "handle" to the base resource, it can produce a handle to the adapted resource.
Ultimately the syntax depends on the specific scheme but it is useful to be as consistent as possible.
Re log: and latency:: I am unsure whether we should attempt to standardize things like this, that are mostly for the purpose of testing and debugging. There should certainly be some defined naming convention for vendor-specific schemes, like zarr-python.log:.
On the one hand adapters like this (and similar things like cache:) are certainly convenient in cases where the URL can easily be altered to enable such features. However, if the URL is already stored somewhere, e.g. in a chunk manifest or hypothetically within some future version of icechunk, then enabling such features within the URL may be less convenient than some out-of-band mechanism.
| is ambiguous with the `file://hostname/path` syntax defined by | ||
| [RFC8089](https://datatracker.ietf.org/doc/html/rfc8089). | ||
|
|
||
| If the path is empty or ends with `/`, the resultant resource kind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What use case is this supporting? Is it really helpful or a somewhat arbitrary decision that may generate incompatibilities in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this particular comment applies specifically to the file: scheme, not to all schemes.
What exactly are you concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there are similar comments in other protocols, like S3. My only concern is do we really need to talk about files and directories? I guess I miss how this information is helping implementers or users, to me it feels like it's creating extra constraints for no added value. But I must be missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For format auto-detection, if a URL is known to refer to a directory resource rather than a file resource, you can skip trying to detect the file format (e.g. by attempting to read from it as a file) and just proceed with detecting the directory format.
With S3 it is technically allowed to have an object name ending in / but then it is unclear how relative URLs should behave. Arguably we should require that if an object name ends in /, the final / be percent-encoded as %2F. Alternatively I could remove this statement for S3.
|
|
||
| For example: | ||
|
|
||
| - `file:///path/to/repo.zarr.icechunk/|icechunk:|zarr3:path/to/array/` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, do I need to specify |zarr3: as an extra adapter after icechunk? In the template given above the path to the node is specified directly after icechunk: without zarr3:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things to take into account:
- Icechunk doesn't support Zarr 2
- The concept of node in Icechunk is somewhat independent of Zarr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, do I need to specify
|zarr3:as an extra adapter after icechunk? In the template given above the path to the node is specified directly aftericechunk:withoutzarr3:
Technically without |zarr3 it is referring to the raw kvstore rather than the zarr node. But implementations may choose to interpret it either as the kvstore or assume/detect zarr3. E.g. in the context of opening a kvstore it would refer to the kvstore but in the context of opening an array it would get auto-detected as zarr3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things to take into account:
- Icechunk doesn't support Zarr 2
- The concept of node in Icechunk is somewhat independent of Zarr
From the perspective of this url syntax, icechunk is just another container like zip that just happens to only be able to contain zarr3-like data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think I like it. Icechunk implementation are free to accept icechunk:/foo/bar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want to avoid differences in behavior between implementations. But you raise a good regarding a leading slash in the path.
Also, the absolute URL syntax would be confusing if a leading slash is not allowed.
So I think for all of the adapters where there is a path, a leading /, if present, should be ignored.
|
|
||
| - `s3://bucket/path/within/bucket` for AWS S3 | ||
|
|
||
| The endpoint, appropriate credentials, and bucket region (for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we dictate a way to optionally specify these options? Otherwise different implementations will do very different things, like url query arguments, inconsistent environment variables, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To specify the endpoint you can use s3+https.
For region, that can always be determined automatically for AWS s3 at least. Are you aware of cases where it can't be determined automatically?
For credentials it seems like they would typically be specified out of band somehow. How do you imagine they might be specified in the url?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts:
- there is a performance impact on discovering the region
- some object stores may not have this ability
- there are other "properties" that are needed, like for example, if the request should be anonymous or not.
I agree credentials shouldn't be placed in the URL.
I wonder if it would save us future trouble to define at least a syntax for properties in the URL, and leave the property names (for now) as protocol or implementation specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re region: The region only needs to be determined once per bucket and can be cached, so the cost is pretty minimal as long as you are making more than one request per bucket. And for anonymous requests you don't need to know the region.
Are you aware of a case where you do need to determine the region?
In general the URL pipeline syntax allows each sub-URL to have its own ?query string and #fragment portion. However, being that the goal here is to define a standard syntax for interoperability across software, I would like to avoid differences between implementations.
An unfortunate consequence of specifying necessary parameters like regions or something to do with credentials as query parameters is that it interferes with supporting tab completion as the user types, since such parameters would come after the path but are needed in order to offer completions for the path. A workaround would be for the user to use the .: relative URL scheme in order to first specify the query parameters and then specify the path, e.g. s3://bucket?region=us-east-1|.:path/goes/here
- New datasource URL syntax based on ZEP 8 proposal (zarr-developers/zeps#48) - Support for ZIP archives
|
I have migrated this proposal over to a separate repository: https://github.com/jbms/url-pipeline If there is interest this could potentially be moved to the zarr-developers organizations. In general I have made some major editorial changes to the proposal:
For now I have removed the relative URL support because that has not been implemented by anyone and it introduced a lot of complexity. I expect it to be added later since relative URLs are important but we can address absolute URLs first. I also changed the OCDBT and Icechunk syntax to use I'd very much welcome any feedback on the specification. |
No description provided.