-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialising a File will also serialise the cache which can grow very large #1747
Comments
You are right, AbstractBufferedFile could do with a Note that OpenFile instances are designed exactly to encapsulate this kind of information without caches and other state - these are what should really be passed around. Additionally, I don't really understand exactly what you are pickling - the xarray object itself? I don't know that such a use is really anticipated. |
No, not xarray objects themselves. Xarray extracts some metadata for netcdf files with the opened file and then this object is put into the graph (that happens in open_mfdataset fwiw). This causes the cache to be populated. I haven't dug through everything in the API yet, so can't tell you exactly when and where this happens |
These don't appear to fit the open_mfdataset pattern in the OP. |
Would someone like to write a reasonable reduce function for AbstractBufferedFile? |
Yep, I would take a look at this |
Btw, just following up here. If I add a small computation like ds.hurs.mean(dim=["lon", "lat"]).compute() to @phofl's original example, the graph is now much smaller with the changes in #1753. With the latest |
I think the problem is still present with the latest main. I don't have a reproducible code yet, but basically this happens only when my local machine running a coiled cluster to create zarr objects from NetCDF opened with s3fs fails (something as simple as the local machine running out of battery) during the After this failure, I'm trying to continue writing adding zarr object. I first, have to overwrite the data in some specific regions, and again, there is an increase of memory and bandwith usage of the local machine running a remote cluster as it happened during
This behaviour happens when i add the I'm very aware there are probably many different issues in my message but I find it hard to know which package (s3fs/xarray/dask...) has a bug. |
Debugging dask within xarray is very hard! If you can show that the cache is being serialised using fsspec alone, we can fix it! |
@martindurant I'm no expert and only notice this when using a combination of remote cluster/dask/s3fs/xarray. I could try to do this locally without dask/remote cluster, but then I have no idea how to spot that the cache is being serialised as I would be running everything on my local machine anyway |
@martindurant , I created this gist previously to highlight the serialisation issue. https://gist.github.com/lbesnard/97bdf0b4af9fa340e8ef47aa20b3cc93 I installed the latest commit of this repo
I still have the same behaviour where data is sent back from the remote cluster to my local machine. If i add s3_fs = s3fs.S3FileSystem(anon=True, default_fill_cache=None) then the behaviour is as expected. However, I still have the same issue in the |
Perhaps testing |
@martindurant Just to let you know that I've commented on a related issue on xarray, which would trigger (I think) a bug in s3fs serializing data back to the local machine when |
Thanks for sharing, @lbesnard . Every time I've tried to debug something happening under several layers of xarray, I get lost. If this effect can be isolated to fsspec, I will definitely fix it! |
We've run into this when using Xarray together with Dask. The default way of calling this is like this at the moment:
The files are accessed initially to get the meta information before they are serialised. The initial access populates the cache with a lot of data.
This triggers a very large cache for 4 of the 130 files which is pretty bad when serialising things.
Serialising the cache doesn't seem like a great idea generally and specifically for remote file systems. Was this decision made intentionally or is this rather something that hasn't been a problem so far?
Ideally, we could purge the cache before things are serialised in fsspec / the inherited libraries. Is this something you would consider? I'd be happy to put up a PR if there is agreement about this.
The text was updated successfully, but these errors were encountered: