Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open on-disk kerchunk references as a virtual dataset #118

Closed
TomNicholas opened this issue May 16, 2024 · 6 comments
Closed

Open on-disk kerchunk references as a virtual dataset #118

TomNicholas opened this issue May 16, 2024 · 6 comments
Labels
references generation Reading byte ranges from archival files

Comments

@TomNicholas
Copy link
Member

TomNicholas commented May 16, 2024

It might be useful to be able to open an existing kerchunk json/parquet file as a virtual dataset, e.g to make changes to it before writing it back out.

This is essentially the kerchunk version of suggestion (2) here #63 (comment).

This should be really easy to implement: We already have a function for doing it (dataset_from_kerchunk_refs), we just have to teach open_virtual_dataset that existing kerchunk json/parquet files are also valid filetypes to pass in.

@TomNicholas TomNicholas added the references generation Reading byte ranges from archival files label May 16, 2024
@jsignell
Copy link
Contributor

I can take a crack at this one.

@TomNicholas
Copy link
Member Author

You might have to fight @norlandrhagen haha

@jsignell
Copy link
Contributor

Oh! I can back off :) I do think there is an argument to be made for having different methods for open_virtual_dataset and open_as_virtual_dataset or something depending on whether the input is kerchunk-style refs vs actual data.

@norlandrhagen
Copy link
Collaborator

Sorry @jsignell! I should have mentioned it in this issue :) If I open a PR, would you mind taking a look at it?

@TomNicholas
Copy link
Member Author

TomNicholas commented May 16, 2024

I do think there is an argument to be made for having different methods for open_virtual_dataset and open_as_virtual_dataset or something depending on whether the input is kerchunk-style refs vs actual data.

Yeah this is an interesting question. The same thing will arise for Zarr stores too: should there be a different function to open zarr arrays backed by chunk manifests vs zarr arrays backed by actual bytes on-disk in the store? I think in that context it would be confusing to have two functions, especially as "mixed" zarr stores are possible (and useful).

@jsignell
Copy link
Contributor

should there be a different function to open zarr arrays backed by chunk manifests vs zarr arrays backed by actual bytes on-disk in the store? I think in that context it would be confusing to have two functions, especially as "mixed" zarr stores are possible (and useful).

I would say it makes sense to my brain to have separate functions for "just reading" vs "doing work" so that I can form an expectation about how long something will take to run. But I would expect the open_as_virtual_dataset to accept either a legacy file or a kerchunk reference file or a zarr or a ref zarr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
references generation Reading byte ranges from archival files
Projects
None yet
Development

No branches or pull requests

3 participants