-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading a netCDF file with multiple variables is very slow #6223
Comments
Yes. But this a downside of deliberate choices in Iris' data model. None of the dimensional metadata (coordinates, cell measures, etcetera) is shared between
We have tried. We implemented #5229 for truly absurd cases where tiny files were taking a long time to load. And we have a benchmark to make sure it doesn't get even worse: iris/benchmarks/benchmarks/load/__init__.py Line 101 in ea5a23e
There are ongoing discussions about opt-in sharing in some form (e.g. #3172), but we have nothing concrete at the moment.
This is presumably because the constraint gets applied after the |
Thanks for your insight @trexfeathers, that all makes sense!
This reduces the runtime by more than 35%! # Note: this example is run on another machine;
# that's why the numbers differ to those given in the PR description
%%timeit
iris.load(multi_path) # 362 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
ds = xr.open_dataset(multi_path, chunks='auto')
ncdata.iris_xarray.cubes_from_xarray(ds) # 224 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
@schlunma good to hear about the speedup. I was actually imagining modifying the NetCDF dataset to remove the variables you are not interested in, rather than going via Xarray. You might get even more speedups that way. |
You're right, extracting the variable in xarray and then using %%timeit
ds = xr.open_dataset(multi_path, chunks='auto')[["tauu_sso", "clat_bnds", "clon_bnds"]]
ncdata.iris_xarray.cubes_from_xarray(ds) # 35.6 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) I didn't know how to extract a variables from an What I also found is that by bypassing %%timeit
ncd = ncdata.netcdf4.from_nc4(multi_path)
ncdata.iris.to_iris(ncd) # 643 ms ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) This is more than 2 times longer than using |
Thanks for looking ! What you can do fairly easily is to remove unwanted variables, using code like
So, I presume in that case you are back to loading all the variables again ? So, I think Xarray is helping here because it analyses the file and grabs the 'other' variables as coords, without making a big deal of it. FWIW Iris can also skip building cubes for unwanted data variables, but only in the rather limited case where a single NameConstraint is provided, which matches just one data variable. See here, and the call to it here. However, if this would be of practical use, we could possibly revisit that approach + extend the cases it can handle ? It would certainly make sense to be able to say something like |
Thanks for all the details @pp-mo, this really helps a lot to understand what's going on here.
Yes, I already tried that and fully agree!
I was just surprised that using xarray as an additional layer is faster than not using it. From what I understand, this effectively does
What we currently do in ESMValTool is loading all cubes without any constraint and then This really only becomes a problem for "raw" climate model data where 10s or even 100s of variables are stored in one netcdf file. Here, the aforementioned preprocessing So yes, being able to do something like It's really great that we can do that now with ncdata! Thanks for all your work on that!! |
📰 Custom Issue
Hi! While evaluating a large number of files with multiple variables each I noticed that ESMValTool is much slower when files contain a lot of variables. I could trace that back to Iris'
load
function. Here is an example of a loading files with 1 and 61 variables:As you can see, loading the file with 61 variables takes ~51 times as long as loading the file with 1 variable. Using a constraint does not help.
Doing the same with xarray gives:
Here, the difference between 1 and 61 variables is only a factor of ~7.
If only a single file needs to be loaded, this is not a problem, but this quickly adds up to a lot of time if 100s or even 1000s of files need to be read (which can be the case for climate models that write one file with many variables per time step).
Have you ever encountered this problem? Are there any tricks to make loading faster? As mentioned, I tried with a constraint, but that didn't work.
Thanks for your help!
Sample data:
The text was updated successfully, but these errors were encountered: