Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider using filesystem-spec #15

Closed
rabernat opened this issue Apr 3, 2020 · 13 comments · Fixed by #19
Closed

consider using filesystem-spec #15

rabernat opened this issue Apr 3, 2020 · 13 comments · Fixed by #19
Labels
internals Internal machinery

Comments

@rabernat
Copy link
Contributor

rabernat commented Apr 3, 2020

I had a quick look at your backend code, and I wanted to suggest you investigate filesystem-spec: https://filesystem-spec.readthedocs.io/en/latest

Filesystem Spec is a project to unify various projects and classes to work with remote filesystems and file-system-like abstractions using a standard pythonic interface.

Using fsspec might allow you to remove some of your code related to file downloading, caching, etc. It might also make it easier to point at different endpoints for the data (e.g. ftp, http, s3). We use it, for example, in llcreader, which is similar to this project (tries to provide a uniform API for reading ECCO LLC data regardless of where it is stored).

An added benefit of using fsspec is its end-to-end compatibility with dask, which is somewhat related to #14.

@gmaze gmaze added the internals Internal machinery label Apr 6, 2020
@gmaze gmaze linked a pull request May 19, 2020 that will close this issue
@gmaze
Copy link
Member

gmaze commented May 25, 2020

After some digging, it seems that the ftp protocol is not cachable (see here https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/ftp.html#FTPFileSystem), which makes fsspec much less attractive now, but still very promising to simplify our code in data fetchers

@gmaze
Copy link
Member

gmaze commented May 25, 2020

my mistake, this works for caching ftp:

fs = fsspec.filesystem("filecache", 
                       target_protocol='ftp', 
                       target_options={'host': 'ftp.ifremer.fr'}, 
                       cache_storage='./tmp')
with fs.open('/ifremer/argo/dac/coriolis/1900067/1900067_prof.nc') as of:
    xr.open_dataset(of) 

@gmaze gmaze closed this as completed in #19 Jun 17, 2020
@rabernat
Copy link
Contributor Author

Wow! I'm so glad to see that something came of this suggestion.

So exciting to see this library progress. 👏

@gmaze
Copy link
Member

gmaze commented Jun 17, 2020

Thanks @rabernat
This has not been easy and I'm sure there is still a lot of room for improvement wrt file/resource access
fsspec is great ! Although the documentation is for experts !
I learned a lot in digging into it, and hopefully this effort on argopy file systems will help us move forward with new and better data access points, like zarr or parquet on cloud storage systems, these will come available soon with Argo (and more) data.
g

@rabernat
Copy link
Contributor Author

Cool. @martindurant, creator of fsspec, has always been very helpful and responsive when working with Pangeo folks. I'm sure he'd be glad to know you're using fsspec and would try to help resolve any technical challenges.

@martindurant
Copy link

Indeed, happy to see it, and that it proved fairly easy to do

the documentation is for experts

This is probably and unfortunately true - fsspec was born initially by factoring out internal implementation details from dask.
What do you think you miss the most in the docs?

@martindurant
Copy link

Note that the concept of fetching and loading datasets with specific arguments is superficially similar to what Intake does. I don't know the world of argo, but it might be interesting for you, as indeed it has been for some of pangeo, whether online file-based catalogues of datasets (e.g., at https://catalog.pangeo.io/browse/master/ ), or catalogs derived from online data services (perhaps intake-esm being a good example)

@gmaze
Copy link
Member

gmaze commented Jun 17, 2020

Indeed, happy to see it, and that it proved fairly easy to do

the documentation is for experts

This is probably and unfortunately true - fsspec was born initially by factoring out internal implementation details from dask.
What do you think you miss the most in the docs?

I surely understand that, and to be honest, I started from very far with file systems !
So I'm not sure the doc needs more.
May be is was just the argopy library internal re-design that took most of my thinking in terms of the key methods I needed (i.e. open, ls, etc ...)
I mostly found myself digging into the fsspec API part and opening the code source, which greatly helped

@gmaze
Copy link
Member

gmaze commented Jun 17, 2020

Note that the concept of fetching and loading datasets with specific arguments is superficially similar to what Intake does. I don't know the world of argo, but it might be interesting for you, as indeed it has been for some of pangeo, whether online file-based catalogues of datasets (e.g., at https://catalog.pangeo.io/browse/master/ ), or catalogs derived from online data services (perhaps intake-esm being a good example)

Yes, I would love an intake catalogue entry with Argo data, and I'm working with the France data center and Ifremer to have it.
But, Argo data are fairly complex and need serious post-processing for the regular end user/scientist, so I'm not sure this is a viable solution.

@quai20
Copy link
Member

quai20 commented Jul 1, 2020

Hi @martindurant , With ffspec implemented in argopy, I came up with an error on my station (I'm back at the office) that wasn't happening on my laptop :

[...]
[...]/site-packages/fsspec/implementations/cached.py in _open(self, path, mode, **kwargs)
    421                 # this only applies to HTTP, should instead use streaming
    422                 f2.write(f.read())
--> 423         self.save_cache()
    424         return self._open(path, mode)
    425 

[...]/site-packages/fsspec/implementations/cached.py in <lambda>(*args, **kw)
    314             # all the methods defined in this class. Note `open` here, since
    315             # it calls `_open`, but is actually in superclass
--> 316             return lambda *args, **kw: getattr(type(self), item)(self, *args, **kw)
    317         if item in ["__reduce_ex__"]:
    318             raise AttributeError

[...]/site-packages/fsspec/implementations/cached.py in save_cache(self)
    157         with open(fn2, "wb") as f:
    158             pickle.dump(cache, f)
--> 159         os.replace(fn2, fn)
    160 
    161     def _check_cache(self):

OSError: [Errno 18] Invalid cross-device link: '/tmp/tmpx3hp3qel' -> '/export/home1/PROJECTS/argopy-cache/cache'

Is that error familiar to you ?
I'm on fsspec 0.7.4

@quai20
Copy link
Member

quai20 commented Jul 1, 2020

as mentionned here, shutil.copy() instead of os.replace() fix my issue.

@gmaze
Copy link
Member

gmaze commented Jul 1, 2020

@quai20 this was discussed here fsspec/filesystem_spec#322
and fixed in fsspec/filesystem_spec#323
So this should be rolled out in the next fsspec release

@quai20
Copy link
Member

quai20 commented Jul 1, 2020

My bad 😃 ! Thanks @gmaze

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals Internal machinery
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants