Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZEP 8 (URL syntax) draft #48

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
293 changes: 293 additions & 0 deletions draft/ZEP0008.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
---
layout: default
title: ZEP0008
description: URL syntax
parent: draft ZEPs
nav_order: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nav_order: 1
nav_order: 8

---

# ZEP 8 — URL syntax

---

```
Author: Jeremy Maitin-Shepard <jbms@google.com>, Google Research

Status: Draft

Type: Specification

Created: 2023-09-07
```

## Abstract

This proposal defines a URL syntax for specifying how to locate a zarr node.

## Motivation and Scope

A URL syntax for zarr nodes that is common across multiple zarr implementations
enables users to more easily share dataset locations between different tools.

While in simple cases it is sufficient to specify the zarr node location using
an existing, well-established URL scheme like `file` or `http`, or existing but
less-standard URL schemes like `s3` and `gs`, for nested storage mechanisms like
the ZIP format, and for nodes within a zarr hierarchy, there is no existing
established URL syntax.

Additionally, for implementations that support other data formats than zarr v3,
it may be necessary to also indicate the data format as part of the URL syntax.

This proposal defines a new absolute and relative URL syntax addresses this
need.

## Usage and Impact

Zarr implementations that support this URL syntax are expected to provide an API
for opening a Zarr node at a given URL, and for obtaining the URL corresponding
to an open Zarr node.

Using these APIs, data can easily be shared between different Zarr
implementations that support the proposed URL syntax.

## Detailed description

This ZEP defines a URL syntax that may be optionally supported by Zarr
implementations in order to allow the location of a zarr array or group to be
specified in a convenient, implementation-independent way.

### Requirements

This proposal takes into account several key considerations:

- The URL syntax must support specifying:
- The underlying key-value store;
- The path within the key-value store of the root Zarr node;
- Optionally, a path within the Zarr hierachy starting from the root Zarr
node. Note: Currently, as no storage transformers have been defined, the
path to any Zarr node may be specified directly as a path within the
underlying key-value store, making this additional path unneccessary.
- Must support nested key-values stores, like one or more layers of a ZIP
archive within some other key-value store.
- The URL syntax must be compatible with interactive completion as the user
types.

### Absolute URL syntax

THe proposed "zarr URL syntax" has the following grammer:

```
absolute_zarr_url := base_url *( "|" adapter )
```

where a `base_url` specifies an absolute resource location, such as
`file:/path/to/local/file`, and `adapter` specifies a nested resource using a
specified protocol, such as `zip:path/within/zip`. The `base_url` and each of
the `adapter` portions are considered "sub-URLs".

For a given `adapter`, the sequence of `base_url` and prior `adpater` sub-URLs is
called the "parent URL".

The following `base_url` schemes are defined:

- `file:` as defined by [RFC8089](https://datatracker.ietf.org/doc/html/rfc8089).

- `http:` and `https:`

- `s3://bucket/path/within/bucket` for AWS S3

- `gs://bucket/path/within/bucket` for Google Cloud Storage (GCS)

The following `adapter` URL schemes are defined:

- `zip:path/within/zip`

The parent URL must refer to a ZIP file.

- `zarr3:path/within/hierarchy`

The parent URL must refer to a directory containing a zarr v3 group, with a
path that optionally ends in `/`.

Depending on the context in which the URL is specified, if it is not already
implied that the URL specifies a Zarr v3 node, it may be required to
terminate the URL with `|zarr3` to indicate that it is a Zarr v3 node even if
no path within a Zarr hierarchy needs to be specified.

- `..:path/within/outer/sub-URL` and `..:/path/within/outer/sub-URL` may be used
to traverse out from the prior adapter.

Any `..` adapters are resolved in order. The presence of a `..` adapter
causes the prior adapter to be discarded,. The path component is interpreted
relative to the path of the sub-URL immediately prior to the discarded adapter
sub-URL.

It is an error if there are no remaining adapter sub-URLs when resolving the
`..` adapter.

For security, implementation SHOULD place limits on where this scheme is
permitted.

If the adapter URL would otherwise consist of just the scheme followed by ":",
it is permitted to omit the final ":". For example:

- `https://example.com/path/to/archive.zip|zip|zarr3` is equivalent to
`httsps://example.com/path/to/archive.zip|zip:|zarr3:`.

It is expected that additional URL schemes may be standardized in the future.

#### Examples

Examples:

- `https://server.example.com:1234/path/to/array`

Specifies a normal HTTPS URL.

- `s3://bucket/path/to/file`

Specifies:
- within the AWS S3 bucket named `bucket`,
- the path `path/to/file`.

- `gs://bucket/path/to/outer.zip|zip:path/to/inner.zip|zip:path/to/zarr/hierarchy|zarr3:path/to/array`

Specifies:
- within the GCS bucket named `bucket`,
- within the ZIP file at the path `path/to/outer.zip`,
- within the ZIP file at the path `path/to/inner.zip`,
- within the Zarr v3 hierarchy at the path `path/to/zarr/hierarchy/`,
- the Zarr v3 node at the path `path/to/array/`.

- `gs://bucket/path/to/outer.zip|zip:path/to/inner.zip|..:other/zarr/hierarchy|zarr3:path/to/array`

Normalizes to:

`gs://bucket/path/to/other/zarr/hierarchy/|zarr3:path/to/array`


### Relative URL syntax

```
relative_zarr_url := ( absolute_path / relative_path ) *( "|" adapter )
/ absolute_zarr_url
```

A relative zarr URL is always resolved relative to a specified
`absolute_zarr_url`. The initial `absolute_path` or `relative_path` applies to
the path component of the inner-most (last) sub-URL. If the `relative_path` is
the empty string, the path component of the inner-most sub-URL remains
unchanged. After applying the `absolute_path` or `relative_path` to the
existing absolute URL, any specified adapter sub-URLs are appended.

As with regular URL syntax, it is not permitted for the first component of
`relative_path` to contain a colon (`:`), e.g. `a:b`, since that would be
ambiguous with specifying the base URL scheme for an absolute URL. Instead,
such a relative URL must be prefixed with `./`, e.g. `./a:b`.

#### Examples

- Base URL: `gs://bucket/path/to/`
Relative URL: `file.zip|zip:path/within/zip`
Resolved URL: `gs://bucket/path/to/file.zip|zip:path/within/zip`

- Base URL: `gs://bucket/path/to/file.zip|zip:path/within/zip`
Relative URL: `..:/path/to/other.zip|zip:path/in/other/zip`
Resolved URL: `gs://bucket/path/to/other.zip|zip:path/in/other/zip`

## Rationale

The use of outer-to-inner order for the sub-URLs enables completion of both
paths and sub-URL schemes as the user types.

The sub-URL delimiter of `|` was chosen because it is not a valid URL character,
and therefore does not have any existing valid interpretation within URLs, and
also is evocative of POSIX shell pipe syntax.

## Related Work

### fsspec

The [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) library is
widely used with the zarr-python library to access a variety of storage systems,
and includes support for ZIP files and other nested stores.

Like this proposal, the fsspec URL syntax consists of a sequence of sub-URLs
separated by a delimiter, but differs as follows:

- fsspec uses a delimiter of `::` rather than `|` as used in this proposal.
- fsspec orders sub-URLs from innermost to outermost, which is the opposite
order from what is proposed here.

The use of `::` as a delimiter of the sub-URLs means that fsspec URLs may
conform to the syntax of a normal URL, because `::` is permitted within the
path, query, and fragment components of a URL. This has both advantages and
disadvantages:

- An fsspec URL may be accepted by existing URL parsers/matchers not
specifically designed for fsspec.
- Because the interpreation the `::` delimiter within an fsspec URL differs from
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Because the interpreation the `::` delimiter within an fsspec URL differs from
- Because the interpretation of the `::` delimiter within an fsspec URL differs from

the normal interpretation within a URL, operations such as relative path
resolution designed to operate on URLs generically may execute without errors
on an fsspec URL but produce an incorrect result. In contrast, the use of `|`
within this proposal ensures that the resultant syntax will not be confused
with a valid regular URL, because `|` is not a permitted character within
URLs.

The outer-to-inner order of sub-URLs in the fsspec URL syntax is not compatible
with the usual operation of text completion as the user types. It is also
opposite to the outer-to-inner order used for specifying paths within URLs.

### Apache Commons VFS

The [Apache Commons VFS](https://commons.apache.org/proper/commons-vfs/) is a
Java library that provides capabilities simialr to those of the fsspec Python
library.

The Apache Commons VFS URL syntax specifies the base scheme and all of the
sub-schemes, in inner to outer order, delimited by `:`, followed by the paths
for each scheme, in outer-to-inner order, delimited by `!`.

For example:

- `gz:tar:file:///extra/data/tryVfs/archive.tar!/tardir/content.txt.gz!content.txt`,
which under this proposal would be
`file:///extra/data/tryVfs/archive.tar|tar:tardir/content.txt.gz|gz` (assuming
the existence of `tar` and `gz` adapter schemes).

As with the fsspec syntax, this URL syntax conforms to the standard URL syntax
but has a different interpretation, which has both advantages and disadvantages.

Separating the adapter scheme from the adapter path makes the association of
adapter and path less obvious, particularly if there is more than one adapter.

While the outer-to-inner order of the nested paths makes text completion of the
paths feasible, the URL syntax is not readily compatible with completion of the
nested schemes.

## Backward Compatibility

If Zarr implementations wish to add support for this proposed URL syntax to an
existing generic "open" interface that already supports other syntax, such as a
plain non-absolute file path or the fsspec URL syntax, there are potential
ambiguities:

- A relative file path such as `file:/abc` can also be interpreted as a URL.
Presumably implementations would disambiguate this as a URL, which may (in
rare cases) change the behavior of existing code.
- A nested fsspec URL is unlikely to be a valid URL under this proposal, but a
non-nested fsspec URL may well be a valid URL under this proposal. In many
cases the interpretation will also be the same, but in some cases it may be
subtly different.

## Implementation

None yet.

## Discussion

None yet.

## Copyright

This document has been placed in the public domain.