-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent behavior with fsspec.open with mode='r' vs. mode='rb' #579
Comments
What is it about |
OK, so it's the line
which LocalFileOpener is not. The reason is, that to derive from io.IOBase would mean either using |
fsspec might be a decent place to create utility functions like |
But on a deeper level, don't you think it's weird that |
Not really? |
But why does does |
True, that could be fixed too. |
I am sitting with @cisaacstern working on pangeo forge and we have hit this issue again. Let me show why it is problematic in the context of pangeo forge. goal: open files from different locations (local, http, s3) using We will work with the following example dataset: https://www.unidata.ucar.edu/software/netcdf/examples/sresa1b_ncar_ccsm3-example.nc We will try to access the file both locally and over http. import fsspec
import xarray as xr
# open locally
of_local = fsspec.open("sresa1b_ncar_ccsm3-example.nc", mode="rb")
xr.open_dataset(of_local) # works
# open over http
of_http = fsspec.open(url, mode="rb")
xr.open_dataset(of_http) # does NOT work
with of_http as fp:
xr.open_dataset(of_http) # works The problem here is that we need this extra context manager only in the case of the http-based file, but not with the local file. This means that we need special case logic for different types of files. This feels wrong. @martindurant, do you have any recommendations here? How can we achieve our goal? Should we be using fsspec differently. |
The context manager or |
Interesting. I tried that over in pangeo-forge/pangeo-forge-recipes#370 and it turned up some serialization issues with beam. If I do with of_http as fp:
ds = xr.open_dataset(of_http) is the resulting dataset object serializable? |
I am not certain. I would have thought yes. |
This issue has been a thorn in our side for a while time, so I did a deep dive. Warning, copious detail ahead The following code tests all combinations of the following five different parameters
We test two things:
code to generate tablefrom pickle import dumps
import fsspec
from fsspec.implementations.http import HTTPFileSystem
from fsspec.implementations.local import LocalFileSystem
import xarray as xr
import pandas as pd
def open_from_fs(path):
if path.startswith("http"):
fs = HTTPFileSystem()
else:
fs = LocalFileSystem()
return fs.open(path)
def open_from_open(path):
return fsspec.open(path)
def xr_open_directly(thing, **kwargs):
ds = xr.open_dataset(thing, **kwargs)
return thing
def xr_open_with_context_manager(thing, **kwargs):
with thing as fp:
with xr.open_dataset(fp, **kwargs) as ds:
return ds
files = {
'netcdf4': 'https://www.unidata.ucar.edu/software/netcdf/examples/OMI-Aura_L2-example.nc',
'netcdf3': 'https://www.unidata.ucar.edu/software/netcdf/examples/sresa1b_ncar_ccsm3-example.nc'
}
engines = {
'netcdf4': ('h5netcdf', 'netcdf4', None),
'netcdf3': ('scipy', 'netcdf4', None)
}
fsspec_open_methods = {
'fsspec.open': open_from_fs,
'fs.open': open_from_open
}
xr_open_methods = {
'direct_open': xr_open_directly,
'context_manager': xr_open_with_context_manager
}
results = []
columns = ('file type', 'url_type', 'fssepc_open_method', 'xr_open_method', 'engine', 'open_status', 'pickle_status')
for file_type, url in files.items():
local_path = url.split('/')[-1]
for url_type, path in zip(('http', 'local'), (url, local_path)):
for open_method, open_fn in fsspec_open_methods.items():
for xr_open_method, xr_open_fn in xr_open_methods.items():
for engine in engines[file_type]:
params = (file_type, url_type, open_method, xr_open_method, engine or "None")
pickle_status = open_status = ""
try:
open_file = open_fn(path)
ds = xr_open_fn(open_file, engine=engine)
open_status = "✅"
try:
_ = dumps(ds)
pickle_status = "✅"
except Exception as e1:
pickle_status = f"❌ {type(e1).__name__}: {e1}".replace('\n', ' ')
except Exception as e2:
open_status = f"❌ {type(e2).__name__}: {e2}".replace('\n', ' ')
pickle_status = "n/a"
finally:
open_file.close()
results.append(
params + (open_status, pickle_status)
)
df = pd.DataFrame(data=results, columns=columns)
display(df)
df.to_markdown("results.md", index=False, tablefmt="github") huge table with all the parameters
There is a lot of information in that table. So Let's focus just on netcdf4 files and the Always use Context ManagerAlways using the context manager (as recommend in #579 (comment)) narrows the results down to
Here see that we can't use Don't Use Context ManagerWe can disregard the advice of #579 (comment) and not use the context manager, passing the
Here we can't use url = 'https://www.unidata.ucar.edu/software/netcdf/examples/OMI-Aura_L2-example.nc' # netcdf
open_file = fsspec.open(url)
ds = xr.open_dataset(open_file, engine='h5netcdf') traceback
Why does this matter?In Pangeo Forge, we need to open Xarray datasets and serialize them starting from all four of the following type of objects
As this list shows, we can't just decide to always use a context manager or never use a context manager. The only viable option today is what I do in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/370/files#r884065015: use try / except logic to find something that works. I would prefer instead to have a stronger contract with fsspec. What is the core problem in fsspec?IMO the core issue is the fact that We can continue working around this in Pangeo Forge, but having to try / except around this makes Pangeo Forge more fragile; what if a different, legitimate i/o error comes up inside the try block? It would be hard to detect. Furthermore, based on this analysis, we can only assume that more inconsistent behavior is lurking in all of the other implementations (gcsfs, s3fs, ftp, etc.), waiting to trip us up. I hope I have managed to articulate this problem. Martin, would really appreciate your advice on where to go from here. |
The resolution to this appears to be actually to not use the context manager. This is what we came up with in google/xarray-beam#49 (comment) if hasattr(open_file, "open"):
open_file = open_file.open()
ds = xr.open_dataset(open_file) This seems to work with whatever fsspec can throw at us. |
It's unclear to me how to use
fsspec.open
as a contextmanager. The behavior of this function seems inconsistent. The inconsistency is somehow related to themode
argument. I will try to illustrate this with an example.Create a test file
Bypass fsspec completely
Open the file with a FileSystem instance
When I open the file via an instantiated FileSystem instance, everything works as I expect
The objects yielded to the context manager look like standard open file objects and can be used as such throughout python.
Open the file via
fsspec.open
With⚠️ this is the key problem. In order to get a
mode='r'
, thefsspec.open
yields object is the same type of object asfs.open
. But withmode='rb'
, we get aLocalFileOpener
.BufferedReader
, we need an additional context manager!I can't figure out what this
LocalFileOpener
object is. It's not documented in the API docs. Most importantly for my purposes, xarray can't figure out what to do with it if I pass it toopen_dataset
. In contrast, it handles an_io.BufferedReader
object fine.Proposed resolution
I would propose to remove the inconsistency and have
with fsspec.open(fname, mode='rb')
yield an_io.BufferedReader
object. However, I recognize this could be a breaking change for some applications that rely on theLocalFileOpener
.The text was updated successfully, but these errors were encountered: