Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load is VERY slow for a NetCDF multi-variable file #4134

Closed
senesis opened this issue May 14, 2021 · 16 comments · Fixed by #4135
Closed

Load is VERY slow for a NetCDF multi-variable file #4134

senesis opened this issue May 14, 2021 · 16 comments · Fixed by #4135
Assignees

Comments

@senesis
Copy link

senesis commented May 14, 2021

📰 Custom Issue

When loading a single variable from a quite small NetCDF file which includes 300 variables, the load time is very large : around 100 seconds (while it is less than 0.1s for a similar single variable file).

This is a bottleneck for trying to use Iris (trough ESMValTool) for handling some climate model native data format.

The attached notebook load_time_histmth.pdf demonstrates the issue and includes a profiling, which shows that the most time consuming function is (by large) NetCDFDataProxy.getitem

The data file is available here

System info is :

uname -a
Linux ciclad-ng.private.ipsl.fr 2.6.32-754.35.1.el6.x86_64 #1 SMP Wed Oct 7 03:47:54 CDT 2020 x86_64 x86_64 x86_64 GNU/Linux

@bjlittle
Copy link
Member

@senesis Thanks for taking the time to report this, much appreciated.

Could you just confirm the version of iris that you're using? I'm assuming v3.0.1?

@senesis
Copy link
Author

senesis commented May 14, 2021

Yes it is v3.0.1

@bjlittle
Copy link
Member

@senesis Great thanks. And which version of Python? v3.8?

@bjlittle
Copy link
Member

@senesis We have a patch in the pipeline that will go towards significantly alleviating this issue.

We're going to target this for the forthcoming iris v3.0.2 release 👍

See the v3.0.x Release Discussion for further details.

@bjlittle bjlittle pinned this issue May 14, 2021
@pp-mo
Copy link
Member

pp-mo commented May 14, 2021

@senesis great issue -- your account is excellent + should make the problem reproducible !

I was already looking into issues with slow netcdf loads, specifically with lots of variables (i.e. we also found something similar).
I think I may already have a solution that will at least "alleviate" it ... coming shortly I hope.

In the meantime though, I wonder if you could put up your 'load_time_histmth' notebook so I can test out with that, maybe in a Gist ??

@pp-mo
Copy link
Member

pp-mo commented May 14, 2021

Stop press: see #4135
I just tested this with your file 'Iris_multivar_data_file.nc'
For me, it speeds up loading that (300-odd cubes) from ~50 to ~5 secs.
Win ! 😀

@senesis
Copy link
Author

senesis commented May 14, 2021

In the meantime though, I wonder if you could put up your 'load_time_histmth' notebook so I can test out with that, maybe in a Gist ??

I am not familiar with Gist. The notebook is available here

@senesis
Copy link
Author

senesis commented May 14, 2021

For me, it speeds up loading that (300-odd cubes) from ~50 to ~5 secs.

Great , I am looking forward to get 3.0.2 (and that ESMValTool uses it)

@rcomer rcomer linked a pull request May 14, 2021 that will close this issue
@bjlittle
Copy link
Member

@senesis GitHub gists are a fantastic way to easily share snippets of code and notebooks with your peers.

Checkout the GitHub document for further details 👍

@pp-mo
Copy link
Member

pp-mo commented May 17, 2021

The notebook is available here

Well I tried the notebook, but I'm not sure if it delivers any more info really, as I am just using it with the same 'Iris_multivar_data_file.nc' file you mentioned above, which may not be the same.
Anyway, it has ~260 variables with dimensions "t, y, x".
For me the basic load is taking ~46 seconds, which reduces to 5.7 secs with the #4135 fix.

@senesis
Copy link
Author

senesis commented May 17, 2021

The notebook is available here

Well I tried the notebook, but I'm not sure if it delivers any more info really, as I am just using it with the same 'Iris_multivar_data_file.nc' file you mentioned above, which may not be the same.
Anyway, it has ~260 variables with dimensions "t, y, x".
For me the basic load is taking ~46 seconds, which reduces to 5.7 secs with the #4135 fix.

I actually used the same file in my notebook run.
Thanks again for the fix.

@bjlittle bjlittle unpinned this issue May 26, 2021
@bjlittle
Copy link
Member

Closed by #4158

@pp-mo
Copy link
Member

pp-mo commented Mar 14, 2022

I just realised, I'm not sure if you people are aware of the potential impact of #4572 ?
This should have considerable potential for improving load performance.

I believe we did discuss the issues raised by this in #3333.
In summary of #4572, I for one have now basically changed my mind on how to do this :

  • I concluded that it is not practical to specify loading controls in Iris/CF terms
    • since we can't easily identify what a given file variable would become, when loaded into Iris, before we have done so
  • so we must instead specify controls using more low-level file-level names
    • i.e. var_names and names of dimensions
  • However, focussing on dimensions rather than variables should allow controls to be more easily and consistently

I believe that the detail of what seems like a useful API and feature-set really needs some trials, to examine specific practical cases.
So feedback is very welcome. -- please review / try out #4572 + feed back !

@senesis
Copy link
Author

senesis commented Mar 16, 2022

From the API doc , I do not understand how this feature could be used when the use case is 'just speed up loading a single variable from a multi-variables file, whatever the dimensions set and sizes'

@pp-mo
Copy link
Member

pp-mo commented Mar 16, 2022

@senesis I do not understand how this feature could be used when the use case is 'just speed up loading a single variable

Apologies, I think you are right -- it doesn't have much relevance to this case after all.

I think I got my wires crossed here -- I was looking for an ESMValTool-related loading issue issue I thought I remembered, where chunking definitely was an issue. But this isn't it !
Do you recall something like that problem elsewhere @senesis ?
( I couldn't find it, except maybe #3362 which I think is essentially solved )

@senesis
Copy link
Author

senesis commented Mar 16, 2022

Do you recall something like that problem elsewhere @senesis ?

No,sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants