-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ZEP 8 (URL syntax) draft #48
Open
jbms
wants to merge
1
commit into
zarr-developers:main
Choose a base branch
from
jbms:zep8
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,293 @@ | ||||||
--- | ||||||
layout: default | ||||||
title: ZEP0008 | ||||||
description: URL syntax | ||||||
parent: draft ZEPs | ||||||
nav_order: 1 | ||||||
--- | ||||||
|
||||||
# ZEP 8 — URL syntax | ||||||
|
||||||
--- | ||||||
|
||||||
``` | ||||||
Author: Jeremy Maitin-Shepard <jbms@google.com>, Google Research | ||||||
|
||||||
Status: Draft | ||||||
|
||||||
Type: Specification | ||||||
|
||||||
Created: 2023-09-07 | ||||||
``` | ||||||
|
||||||
## Abstract | ||||||
|
||||||
This proposal defines a URL syntax for specifying how to locate a zarr node. | ||||||
|
||||||
## Motivation and Scope | ||||||
|
||||||
A URL syntax for zarr nodes that is common across multiple zarr implementations | ||||||
enables users to more easily share dataset locations between different tools. | ||||||
|
||||||
While in simple cases it is sufficient to specify the zarr node location using | ||||||
an existing, well-established URL scheme like `file` or `http`, or existing but | ||||||
less-standard URL schemes like `s3` and `gs`, for nested storage mechanisms like | ||||||
the ZIP format, and for nodes within a zarr hierarchy, there is no existing | ||||||
established URL syntax. | ||||||
|
||||||
Additionally, for implementations that support other data formats than zarr v3, | ||||||
it may be necessary to also indicate the data format as part of the URL syntax. | ||||||
|
||||||
This proposal defines a new absolute and relative URL syntax addresses this | ||||||
need. | ||||||
|
||||||
## Usage and Impact | ||||||
|
||||||
Zarr implementations that support this URL syntax are expected to provide an API | ||||||
for opening a Zarr node at a given URL, and for obtaining the URL corresponding | ||||||
to an open Zarr node. | ||||||
|
||||||
Using these APIs, data can easily be shared between different Zarr | ||||||
implementations that support the proposed URL syntax. | ||||||
|
||||||
## Detailed description | ||||||
|
||||||
This ZEP defines a URL syntax that may be optionally supported by Zarr | ||||||
implementations in order to allow the location of a zarr array or group to be | ||||||
specified in a convenient, implementation-independent way. | ||||||
|
||||||
### Requirements | ||||||
|
||||||
This proposal takes into account several key considerations: | ||||||
|
||||||
- The URL syntax must support specifying: | ||||||
- The underlying key-value store; | ||||||
- The path within the key-value store of the root Zarr node; | ||||||
- Optionally, a path within the Zarr hierachy starting from the root Zarr | ||||||
node. Note: Currently, as no storage transformers have been defined, the | ||||||
path to any Zarr node may be specified directly as a path within the | ||||||
underlying key-value store, making this additional path unneccessary. | ||||||
- Must support nested key-values stores, like one or more layers of a ZIP | ||||||
archive within some other key-value store. | ||||||
- The URL syntax must be compatible with interactive completion as the user | ||||||
types. | ||||||
|
||||||
### Absolute URL syntax | ||||||
|
||||||
THe proposed "zarr URL syntax" has the following grammer: | ||||||
|
||||||
``` | ||||||
absolute_zarr_url := base_url *( "|" adapter ) | ||||||
``` | ||||||
|
||||||
where a `base_url` specifies an absolute resource location, such as | ||||||
`file:/path/to/local/file`, and `adapter` specifies a nested resource using a | ||||||
specified protocol, such as `zip:path/within/zip`. The `base_url` and each of | ||||||
the `adapter` portions are considered "sub-URLs". | ||||||
|
||||||
For a given `adapter`, the sequence of `base_url` and prior `adpater` sub-URLs is | ||||||
called the "parent URL". | ||||||
|
||||||
The following `base_url` schemes are defined: | ||||||
|
||||||
- `file:` as defined by [RFC8089](https://datatracker.ietf.org/doc/html/rfc8089). | ||||||
|
||||||
- `http:` and `https:` | ||||||
|
||||||
- `s3://bucket/path/within/bucket` for AWS S3 | ||||||
|
||||||
- `gs://bucket/path/within/bucket` for Google Cloud Storage (GCS) | ||||||
|
||||||
The following `adapter` URL schemes are defined: | ||||||
|
||||||
- `zip:path/within/zip` | ||||||
|
||||||
The parent URL must refer to a ZIP file. | ||||||
|
||||||
- `zarr3:path/within/hierarchy` | ||||||
|
||||||
The parent URL must refer to a directory containing a zarr v3 group, with a | ||||||
path that optionally ends in `/`. | ||||||
|
||||||
Depending on the context in which the URL is specified, if it is not already | ||||||
implied that the URL specifies a Zarr v3 node, it may be required to | ||||||
terminate the URL with `|zarr3` to indicate that it is a Zarr v3 node even if | ||||||
no path within a Zarr hierarchy needs to be specified. | ||||||
|
||||||
- `..:path/within/outer/sub-URL` and `..:/path/within/outer/sub-URL` may be used | ||||||
to traverse out from the prior adapter. | ||||||
|
||||||
Any `..` adapters are resolved in order. The presence of a `..` adapter | ||||||
causes the prior adapter to be discarded,. The path component is interpreted | ||||||
relative to the path of the sub-URL immediately prior to the discarded adapter | ||||||
sub-URL. | ||||||
|
||||||
It is an error if there are no remaining adapter sub-URLs when resolving the | ||||||
`..` adapter. | ||||||
|
||||||
For security, implementation SHOULD place limits on where this scheme is | ||||||
permitted. | ||||||
|
||||||
If the adapter URL would otherwise consist of just the scheme followed by ":", | ||||||
it is permitted to omit the final ":". For example: | ||||||
|
||||||
- `https://example.com/path/to/archive.zip|zip|zarr3` is equivalent to | ||||||
`httsps://example.com/path/to/archive.zip|zip:|zarr3:`. | ||||||
|
||||||
It is expected that additional URL schemes may be standardized in the future. | ||||||
|
||||||
#### Examples | ||||||
|
||||||
Examples: | ||||||
|
||||||
- `https://server.example.com:1234/path/to/array` | ||||||
|
||||||
Specifies a normal HTTPS URL. | ||||||
|
||||||
- `s3://bucket/path/to/file` | ||||||
|
||||||
Specifies: | ||||||
- within the AWS S3 bucket named `bucket`, | ||||||
- the path `path/to/file`. | ||||||
|
||||||
- `gs://bucket/path/to/outer.zip|zip:path/to/inner.zip|zip:path/to/zarr/hierarchy|zarr3:path/to/array` | ||||||
|
||||||
Specifies: | ||||||
- within the GCS bucket named `bucket`, | ||||||
- within the ZIP file at the path `path/to/outer.zip`, | ||||||
- within the ZIP file at the path `path/to/inner.zip`, | ||||||
- within the Zarr v3 hierarchy at the path `path/to/zarr/hierarchy/`, | ||||||
- the Zarr v3 node at the path `path/to/array/`. | ||||||
|
||||||
- `gs://bucket/path/to/outer.zip|zip:path/to/inner.zip|..:other/zarr/hierarchy|zarr3:path/to/array` | ||||||
|
||||||
Normalizes to: | ||||||
|
||||||
`gs://bucket/path/to/other/zarr/hierarchy/|zarr3:path/to/array` | ||||||
|
||||||
|
||||||
### Relative URL syntax | ||||||
|
||||||
``` | ||||||
relative_zarr_url := ( absolute_path / relative_path ) *( "|" adapter ) | ||||||
/ absolute_zarr_url | ||||||
``` | ||||||
|
||||||
A relative zarr URL is always resolved relative to a specified | ||||||
`absolute_zarr_url`. The initial `absolute_path` or `relative_path` applies to | ||||||
the path component of the inner-most (last) sub-URL. If the `relative_path` is | ||||||
the empty string, the path component of the inner-most sub-URL remains | ||||||
unchanged. After applying the `absolute_path` or `relative_path` to the | ||||||
existing absolute URL, any specified adapter sub-URLs are appended. | ||||||
|
||||||
As with regular URL syntax, it is not permitted for the first component of | ||||||
`relative_path` to contain a colon (`:`), e.g. `a:b`, since that would be | ||||||
ambiguous with specifying the base URL scheme for an absolute URL. Instead, | ||||||
such a relative URL must be prefixed with `./`, e.g. `./a:b`. | ||||||
|
||||||
#### Examples | ||||||
|
||||||
- Base URL: `gs://bucket/path/to/` | ||||||
Relative URL: `file.zip|zip:path/within/zip` | ||||||
Resolved URL: `gs://bucket/path/to/file.zip|zip:path/within/zip` | ||||||
|
||||||
- Base URL: `gs://bucket/path/to/file.zip|zip:path/within/zip` | ||||||
Relative URL: `..:/path/to/other.zip|zip:path/in/other/zip` | ||||||
Resolved URL: `gs://bucket/path/to/other.zip|zip:path/in/other/zip` | ||||||
|
||||||
## Rationale | ||||||
|
||||||
The use of outer-to-inner order for the sub-URLs enables completion of both | ||||||
paths and sub-URL schemes as the user types. | ||||||
|
||||||
The sub-URL delimiter of `|` was chosen because it is not a valid URL character, | ||||||
and therefore does not have any existing valid interpretation within URLs, and | ||||||
also is evocative of POSIX shell pipe syntax. | ||||||
|
||||||
## Related Work | ||||||
|
||||||
### fsspec | ||||||
|
||||||
The [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) library is | ||||||
widely used with the zarr-python library to access a variety of storage systems, | ||||||
and includes support for ZIP files and other nested stores. | ||||||
|
||||||
Like this proposal, the fsspec URL syntax consists of a sequence of sub-URLs | ||||||
separated by a delimiter, but differs as follows: | ||||||
|
||||||
- fsspec uses a delimiter of `::` rather than `|` as used in this proposal. | ||||||
- fsspec orders sub-URLs from innermost to outermost, which is the opposite | ||||||
order from what is proposed here. | ||||||
|
||||||
The use of `::` as a delimiter of the sub-URLs means that fsspec URLs may | ||||||
conform to the syntax of a normal URL, because `::` is permitted within the | ||||||
path, query, and fragment components of a URL. This has both advantages and | ||||||
disadvantages: | ||||||
|
||||||
- An fsspec URL may be accepted by existing URL parsers/matchers not | ||||||
specifically designed for fsspec. | ||||||
- Because the interpreation the `::` delimiter within an fsspec URL differs from | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
the normal interpretation within a URL, operations such as relative path | ||||||
resolution designed to operate on URLs generically may execute without errors | ||||||
on an fsspec URL but produce an incorrect result. In contrast, the use of `|` | ||||||
within this proposal ensures that the resultant syntax will not be confused | ||||||
with a valid regular URL, because `|` is not a permitted character within | ||||||
URLs. | ||||||
|
||||||
The outer-to-inner order of sub-URLs in the fsspec URL syntax is not compatible | ||||||
with the usual operation of text completion as the user types. It is also | ||||||
opposite to the outer-to-inner order used for specifying paths within URLs. | ||||||
|
||||||
### Apache Commons VFS | ||||||
|
||||||
The [Apache Commons VFS](https://commons.apache.org/proper/commons-vfs/) is a | ||||||
Java library that provides capabilities simialr to those of the fsspec Python | ||||||
library. | ||||||
|
||||||
The Apache Commons VFS URL syntax specifies the base scheme and all of the | ||||||
sub-schemes, in inner to outer order, delimited by `:`, followed by the paths | ||||||
for each scheme, in outer-to-inner order, delimited by `!`. | ||||||
|
||||||
For example: | ||||||
|
||||||
- `gz:tar:file:///extra/data/tryVfs/archive.tar!/tardir/content.txt.gz!content.txt`, | ||||||
which under this proposal would be | ||||||
`file:///extra/data/tryVfs/archive.tar|tar:tardir/content.txt.gz|gz` (assuming | ||||||
the existence of `tar` and `gz` adapter schemes). | ||||||
|
||||||
As with the fsspec syntax, this URL syntax conforms to the standard URL syntax | ||||||
but has a different interpretation, which has both advantages and disadvantages. | ||||||
|
||||||
Separating the adapter scheme from the adapter path makes the association of | ||||||
adapter and path less obvious, particularly if there is more than one adapter. | ||||||
|
||||||
While the outer-to-inner order of the nested paths makes text completion of the | ||||||
paths feasible, the URL syntax is not readily compatible with completion of the | ||||||
nested schemes. | ||||||
|
||||||
## Backward Compatibility | ||||||
|
||||||
If Zarr implementations wish to add support for this proposed URL syntax to an | ||||||
existing generic "open" interface that already supports other syntax, such as a | ||||||
plain non-absolute file path or the fsspec URL syntax, there are potential | ||||||
ambiguities: | ||||||
|
||||||
- A relative file path such as `file:/abc` can also be interpreted as a URL. | ||||||
Presumably implementations would disambiguate this as a URL, which may (in | ||||||
rare cases) change the behavior of existing code. | ||||||
- A nested fsspec URL is unlikely to be a valid URL under this proposal, but a | ||||||
non-nested fsspec URL may well be a valid URL under this proposal. In many | ||||||
cases the interpretation will also be the same, but in some cases it may be | ||||||
subtly different. | ||||||
|
||||||
## Implementation | ||||||
|
||||||
None yet. | ||||||
|
||||||
## Discussion | ||||||
|
||||||
None yet. | ||||||
|
||||||
## Copyright | ||||||
|
||||||
This document has been placed in the public domain. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.