Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collaboration? #9

Open
martindurant opened this issue Jul 29, 2021 · 4 comments
Open

collaboration? #9

martindurant opened this issue Jul 29, 2021 · 4 comments

Comments

@martindurant
Copy link

Hi there! I recently heard about this project via a colleague watching your presentation at ESIP.

I am the lead developer of fsspec and the fsspec-reference-maker. You might find the following articles interesting:

After only a brief perusal of this repo, the following comparisons come to mind.

Things we can learn from you (there are probably more!)

  • combining adjacent reads (like corot project); this is tractable, but less important given that our reads are concurrent
  • caching redirects from auth mechanism

Things we do that you might find interesting in our project

  • combining many datasets into singe ensemble datasets
  • not just HDF5, but also grib2 and Tiff and more to come
  • any storage backend supported by fsspec (s3, gcs, azure, http, ftp, ..., see here); references and data can live separately. Could be made to work with data living on different storage backends but part of a single aggregate data set.
  • concurrent access to chunks for S3, Http, GCS, abfs
  • explicit ability to be serialised for distributed processing with dask
  • requires no extra installs aside from standard xarray, zarr, fsspec
  • doesn't necessarily need xarray; is intended to be multi-language
  • json, parquet or zarr storage of the reference metadata files

Let's work together and not invent more wheels!

@bilts
Copy link
Member

bilts commented Sep 20, 2021

Hi, Martin! Sorry I'm late getting back to you and thanks for chiming in.

I've been keeping an eye on fsspec-reference-maker which I saw at ESIP as well and trying to figure out what to do with it. My strong preference is to just use something standard vs something EOSDIS-specific, so it has me really excited.

Right now we already generate DMR++ for OPeNDAP support, so one of the purposes of this work was to take what we're already doing and build on it, rather than getting new metadata generated. The latter is possible, but trickier.

To that end, how stable is the fsspec format for storing chunk offsets, and how is it governed? (i.e. how often would we need to re-generate the metadata once we've done it once)

Thanks!

@martindurant
Copy link
Author

To answer your questions

how stable is the fsspec format

Our intent is to make it fully backward compatible, adding only new features. In the readme, you will see that we already had a Version 0 (before the spec was written down) and Version 1.

how often would we need to re-generate the metadata once we've done it once

If the data does not change, you do not need to change the metadata. A common patter, though, might be to generate metadata for individual files and save them, but then, later, create various aggregated views of these as requirements change. This would be relatively cheap. Also, since the metadata is fairly simple JSON, it could be readily edited if, for example, the file path naming of the originals were to change.

An additional note on my point previously

combining adjacent reads (like corot project); this is tractable, but less important given that our reads are concurrent

In a separate problem around fetching exactly those bytes ranges of a parquet dataset that will actually be required to get the data requested, we are facing this problem (independently). Any algorithm you have to consider a bunch of byte range and combine them on some heuristic (overlaps, gaps, expected latency), and then re-extract the ranges after fetch - this would be massively appreciated!

@bilts
Copy link
Member

bilts commented Nov 16, 2021

We have three methods that do detection of byte ranges, merging adjacent ranges within n bytes of each other, and splitting bytes after retrieval: https://github.com/nasa/zarr-eosdis-store/blob/main/eosdis_store/stores.py#L170-L247

I'm sure they could be optimized more, but they've worked for the purpose of this library.

@martindurant
Copy link
Author

martindurant commented Nov 17, 2021

An update on our end, we have the following bytes-range merge code in fsspec: https://github.com/fsspec/filesystem_spec/blob/642e94aac03b4fec9d438e32f5988bbf4d292184/fsspec/utils.py#L488 (being used only by the parquet route at the moment).

I'll have a look at the code you link to, to see if it can be adapted to our use case (cc @rjzamora )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants