Skip to content

Conversation

@jbms
Copy link

@jbms jbms commented Sep 10, 2023

No description provided.

@jbms
Copy link
Author

jbms commented Sep 10, 2023

@normanrz Please take a look.

@jbms
Copy link
Author

jbms commented Sep 10, 2023

@MSanKeys963 Looks like there is an issue with the docs build that is unrelated to this PR.

@jbms
Copy link
Author

jbms commented Sep 10, 2023

@martindurant Would appreciate your perspective on this --- I imagine you might say that we should just use fsspec syntax instead, though.

@martindurant
Copy link
Member

Well indeed, I could say "why invent another"; although translating between | and :: syntax ought to be straight forward. fsspec also cares about fs parameters that might be embedded in URLs and wildcards for globbing.

@normanrz
Copy link
Member

While standardizing a URL scheme has benefits on its own, I think the main benefit/motivation for this ZEP is the formalization of Zip stores. Essentially, to comply with this ZEP, implementations need to implement zip stores. Maybe that should be written out more explicitly?

@jbms
Copy link
Author

jbms commented Sep 22, 2023

While standardizing a URL scheme has benefits on its own, I think the main benefit/motivation for this ZEP is the formalization of Zip stores. Essentially, to comply with this ZEP, implementations need to implement zip stores. Maybe that should be written out more explicitly?

While this ZEP was prompted by our discussion about zip stores, my intention was that we standardize on the syntax for various protocols, but that implementations would choose which ones to support.

I think we could also push implementations to support zip format, but I'm not sure I want to tie that to this URL syntax proposal.

@normanrz
Copy link
Member

@ap-- I think this might also be interesting for upath to implement.

@normanrz
Copy link
Member

@bogovicj this might also be relevant for your OME transformations proposal.

@sanketverma1704
Copy link
Member

sanketverma1704 commented Oct 25, 2023

@MSanKeys963 Looks like there is an issue with the docs build that is unrelated to this PR.

@jbms: I have added #51 to fix the RTD build. Can you please update your PR?
(Seems like I'm unable to update your PR)

@bogovicj
Copy link

bogovicj commented Nov 14, 2023

Thanks @jbms for putting this together! There are a few situations I came up with for which I'm not sure what the
relative URL should be

What does it look like to use ..: to "go up" multiple levels?
Is this correct / valid?

Base URL: gs://bucket/0.zip|zip:a|zarr3:i
Relative URL: ..:..:1.zip|zip:b|zarr3:ii
Resolved URL: gs://bucket/1.zip|zip:b|zarr3:ii

Is it correct / valid to use .. in the "path part" of relative URL, after a ..:?

Base URL: gs://bucket/0/a/i.zarr|zarr3:foo
Relative URL: ..:../b/i.zarr|zarr3:foo
Resolved URL: gs://bucket/0/b/i.zarr|zarr3:foo

If one needs to add an adapter in a relative way, how does one go about it?
For example:

Base URL: gs://bucket/0/a/i.zarr
Desired Resolved URL: gs://bucket/0/a/i.zarr|zarr3:foo

Which, if any, of these do you think should be used? Are any of these invalid?

  • .|zarr3:foo (clearest to me)
  • |zarr3:foo
  • zarr3:foo

@bogovicj
Copy link

One more thing:

We've found it useful to be able to reference a particular part of the attributes stored in json
with a URL. For example, for

this zarr3 zarr.json
{
    "zarr_format": 3,
    "node_type": "array",
    "shape": [10000, 1000],
    "dimension_names": ["rows", "columns"],
    "data_type": "float64",
    "chunk_grid": {
        "name": "regular",
        "configuration": {
            "chunk_shape": [1000, 100]
        }
    },
    "chunk_key_encoding": {
        "name": "default",
        "configuration": {
            "separator": "/"
        }
    },
    "codecs": [{
        "name": "gzip",
        "configuration": {
            "level": 1
        }
    }],
    "fill_value": "NaN",
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [1, 2, 3, 4]
    }
}
  • /attributes/baz[0] points to 1
  • /shape points to [10000, 1000]
  • /chunk_grid/configuration points to { "chunk_shape": [1000, 100] }

Could you envision adding an attributes: or zarr.json:, or similar adapter, that enaables this?

For example: gs://bucket/0.zip|zip:a|zarr3:i|zarr.json:attributes/foo

A specific use case: I often re-use and reference transformations. Since these are described by metadata (not arrays),
and so referencing the specific metadata is helpful.

For example, if this were adopted, something like this would not uncommon in my workflows:

{
    "type" : "sequence",
    "transformations" : [
        { "url" : "..:/localTransformations|zarr.json:/transform[1]" },
        { "url" : "gs://bucket/path/to/templateTransformation.zarr|zarr3:sharedTransforms|zarr.json:/transform[0]" },
    ]
}

@jbms
Copy link
Author

jbms commented Nov 14, 2023 via email

@jbms
Copy link
Author

jbms commented Nov 15, 2023 via email

jbms added a commit to google/neuroglancer that referenced this pull request Jan 17, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 17, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 17, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 17, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 17, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 17, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 17, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 17, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 18, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 18, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 19, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 19, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
jbms added a commit to google/neuroglancer that referenced this pull request Jan 19, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
@sanketverma1704
Copy link
Member

sanketverma1704 commented Jan 22, 2025

From today's Zarr community meeting, @jbms has implemented this ZEP in Neuroglancer. Check here: google/neuroglancer#696

copybara-service bot pushed a commit to google/tensorstore that referenced this pull request May 7, 2025
This is in line with zarr-developers/zeps#48 and
the syntax supported by Neuroglancer.

Currently, zip is supported.  OCDBT support will be added in a
subsequent commit.

PiperOrigin-RevId: 755691199
Change-Id: Ia6cb84c12a986a7dd0ba65e41454fbe6d415aed0
@joshmoore
Copy link
Member

@jbms: I tried pushing a merge of origin to try fixing the build, but was rejected. Could you give it a try?

Copy link

@ianhi ianhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a thorough read of this to understand it and I have left some comments with a few typo fixes.

I also left comments on parts that took me a decent bit of work to understand, or that I don't fully understand in the hope that it's a helpful perspective. I'd rate myself as a competent but not expert reader of a document like this

jbms and others added 5 commits September 25, 2025 13:46
Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
Co-authored-by: Sanket Verma <svsanketverma5@gmail.com>
Co-authored-by: Ian Hunt-Isaak <ianhuntisaak@gmail.com>
Co-authored-by: Joe Hamman <jhamman1@gmail.com>
@jbms
Copy link
Author

jbms commented Sep 26, 2025

I just did a thorough read of this to understand it and I have left some comments with a few typo fixes.

I also left comments on parts that took me a decent bit of work to understand, or that I don't fully understand in the hope that it's a helpful perspective. I'd rate myself as a competent but not expert reader of a document like this

Thanks very much for your review. Based on your comments I made some significant revisions and would appreciate feedback.

Based on my revisions it occurs to me that this may be better as an independent standard, and the zarr spec could just recommend that implementations support it.

Copy link

@ianhi ianhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates, I found it significantly easier to understand on this close reading. I've left a few more comments on the few remaining areas where I found myself confused.


- `gs://bucket/path/to/data|byte-range:1000-2000`

- `tiff:`, `jpeg:`, `png:`, `bmp:`, `avif:`, `webp:`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both these and byte-range, I don't know how something like zarr-python is meant to handle this. Surely zarr can't be responsible for reading different image formats?

This feels like it gets to your point:

Based on my revisions it occurs to me that this may be better as an independent standard,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some of these aren't relevant to a zarr implementation at the moment. But note that they might be relevant in the context of the proposed chunk manifest or other virtual array proposals. In particular the byte-range URL scheme could eliminate the need to separately specify offset and length inside a chunk manifest.

Karthicks1206 pushed a commit to Karthicks1206/tensorstore that referenced this pull request Oct 24, 2025
This is in line with zarr-developers/zeps#48 and
the syntax supported by Neuroglancer.

Currently, zip is supported.  OCDBT support will be added in a
subsequent commit.

PiperOrigin-RevId: 755691199
Change-Id: Ia6cb84c12a986a7dd0ba65e41454fbe6d415aed0
draft/ZEP0008.md Outdated
- `https://example.com/path/to/archive.zip|zip|zarr3` is equivalent to
`https://example.com/path/to/archive.zip|zip:|zarr3:`.

It is expected that additional URL schemes may be standardized in the future.
Copy link

@ianhi ianhi Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. Just schemes (i.e. root url) or schemes + adapters

  2. Where would this happen?

. Soemthing I'm already seeing that I want is a log: (for a logging store wrapper) or a latency:150 (for a zarr-python latency store with 150 ms latency).

Which also gets to the fact that adapter define both or either of: a path, or metadata about how to read (e.g. byte range/icechunk branch). Woudl it be good to consider the general case of what an adapter can be, and how it specifies both or either of path and metadata here, or is that overkill given that there is a limited set here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment --- yes, both root and adapter schemes.

In general an adapter can be any transformation of the resource. Logically, given some "handle" to the base resource, it can produce a handle to the adapted resource.

Ultimately the syntax depends on the specific scheme but it is useful to be as consistent as possible.

Re log: and latency:: I am unsure whether we should attempt to standardize things like this, that are mostly for the purpose of testing and debugging. There should certainly be some defined naming convention for vendor-specific schemes, like zarr-python.log:.

On the one hand adapters like this (and similar things like cache:) are certainly convenient in cases where the URL can easily be altered to enable such features. However, if the URL is already stored somewhere, e.g. in a chunk manifest or hypothetically within some future version of icechunk, then enabling such features within the URL may be less convenient than some out-of-band mechanism.

is ambiguous with the `file://hostname/path` syntax defined by
[RFC8089](https://datatracker.ietf.org/doc/html/rfc8089).

If the path is empty or ends with `/`, the resultant resource kind
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What use case is this supporting? Is it really helpful or a somewhat arbitrary decision that may generate incompatibilities in the future?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this particular comment applies specifically to the file: scheme, not to all schemes.

What exactly are you concerned about?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there are similar comments in other protocols, like S3. My only concern is do we really need to talk about files and directories? I guess I miss how this information is helping implementers or users, to me it feels like it's creating extra constraints for no added value. But I must be missing something

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For format auto-detection, if a URL is known to refer to a directory resource rather than a file resource, you can skip trying to detect the file format (e.g. by attempting to read from it as a file) and just proceed with detecting the directory format.

With S3 it is technically allowed to have an object name ending in / but then it is unclear how relative URLs should behave. Arguably we should require that if an object name ends in /, the final / be percent-encoded as %2F. Alternatively I could remove this statement for S3.


For example:

- `file:///path/to/repo.zarr.icechunk/|icechunk:|zarr3:path/to/array/`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, do I need to specify |zarr3: as an extra adapter after icechunk? In the template given above the path to the node is specified directly after icechunk: without zarr3:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things to take into account:

  • Icechunk doesn't support Zarr 2
  • The concept of node in Icechunk is somewhat independent of Zarr

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, do I need to specify |zarr3: as an extra adapter after icechunk? In the template given above the path to the node is specified directly after icechunk: without zarr3:

Technically without |zarr3 it is referring to the raw kvstore rather than the zarr node. But implementations may choose to interpret it either as the kvstore or assume/detect zarr3. E.g. in the context of opening a kvstore it would refer to the kvstore but in the context of opening an array it would get auto-detected as zarr3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things to take into account:

  • Icechunk doesn't support Zarr 2
  • The concept of node in Icechunk is somewhat independent of Zarr

From the perspective of this url syntax, icechunk is just another container like zip that just happens to only be able to contain zarr3-like data.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think I like it. Icechunk implementation are free to accept icechunk:/foo/bar.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to avoid differences in behavior between implementations. But you raise a good regarding a leading slash in the path.

Also, the absolute URL syntax would be confusing if a leading slash is not allowed.

So I think for all of the adapters where there is a path, a leading /, if present, should be ignored.


- `s3://bucket/path/within/bucket` for AWS S3

The endpoint, appropriate credentials, and bucket region (for
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we dictate a way to optionally specify these options? Otherwise different implementations will do very different things, like url query arguments, inconsistent environment variables, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To specify the endpoint you can use s3+https.

For region, that can always be determined automatically for AWS s3 at least. Are you aware of cases where it can't be determined automatically?

For credentials it seems like they would typically be specified out of band somehow. How do you imagine they might be specified in the url?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts:

  • there is a performance impact on discovering the region
  • some object stores may not have this ability
  • there are other "properties" that are needed, like for example, if the request should be anonymous or not.

I agree credentials shouldn't be placed in the URL.

I wonder if it would save us future trouble to define at least a syntax for properties in the URL, and leave the property names (for now) as protocol or implementation specific.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re region: The region only needs to be determined once per bucket and can be cached, so the cost is pretty minimal as long as you are making more than one request per bucket. And for anonymous requests you don't need to know the region.

Are you aware of a case where you do need to determine the region?

In general the URL pipeline syntax allows each sub-URL to have its own ?query string and #fragment portion. However, being that the goal here is to define a standard syntax for interoperability across software, I would like to avoid differences between implementations.

An unfortunate consequence of specifying necessary parameters like regions or something to do with credentials as query parameters is that it interferes with supporting tab completion as the user types, since such parameters would come after the path but are needed in order to offer completions for the path. A workaround would be for the user to use the .: relative URL scheme in order to first specify the query parameters and then specify the path, e.g. s3://bucket?region=us-east-1|.:path/goes/here

briossant pushed a commit to briossant/neuroglancer that referenced this pull request Nov 10, 2025
- New datasource URL syntax based on ZEP 8
proposal (zarr-developers/zeps#48)
- Support for ZIP archives
@jbms
Copy link
Author

jbms commented Nov 25, 2025

I have migrated this proposal over to a separate repository:

https://github.com/jbms/url-pipeline

If there is interest this could potentially be moved to the zarr-developers organizations.

In general I have made some major editorial changes to the proposal:

  • I have tried to address a lot of the comments that were raised.
  • The syntax is now specified formally with an ABNF grammar that is fully validated against the examples.

For now I have removed the relative URL support because that has not been implemented by anyone and it introduced a lot of complexity. I expect it to be added later since relative URLs are important but we can address absolute URLs first.

I also changed the OCDBT and Icechunk syntax to use //version/ rather than @version/ since that better matches normal URL syntax and in particular makes more sense in the context of (not yet specified) relative URLs. Note that Neuroglancer and tensorstore have not yet implemented that change and still use the old syntax, but I'm planning to fix that soon.

I'd very much welcome any feedback on the specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants