Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple support for selective netcdf loading. #4176

Merged
merged 6 commits into from
Aug 10, 2021

Conversation

pp-mo
Copy link
Member

@pp-mo pp-mo commented Jun 3, 2021

🚀 Pull Request

Description

An initial heads-up on a really simple way of speeding up netcdf loads ...
... with files of many variables, as in #4134

This is actually much simpler than I imagined, but ..
Blockers to completion

  • probably needs discussion
  • intended usage might require to select multiple data-vars
  • not sure how to test this

Background...

Following #4135, ESMValTool devs are reporting that iris loading is still too slow, where you want only one from a whole lot of diagnostics.

This example follows what we did for UM files in similar circumstances.
It's notable that, in creating a iris.fileformats.cf.CFReader, we are still doing a whole-file analysis, that includes the unwanted data-variables.
I actually don't think you can avoid that, as only context will distinguish a CF data-variable from an aux-coord.
However, the cost of this is not huge. I am finding <1sec for the testfile mentioned in #4134
( which is : ~250mB, ~300 variables of content float[1 * 79 * 143 * 144] )

Some sample timings:

Using testfiles with many identical (small) variables:
n-vars : timings without // with fix
1 : 0.04 // 0.01 [loading 1 of N named variables]
10 : 0.14 // 0.02
30 : 0.45 // 0.03
100 : 1.70 // 0.07
300 : 8.19 // 0.40
(314) : 45.36 // 0.59

case (314) is based on the testfile mentioned in #4134
( I suspect it may be slower than the '300' because the variables data is larger?? WIP )
with the code like

cube = iris.load_cube(
    'Iris_multivar_data_file.nc',
    NameConstraint(long_name='Air Surface Temperature'))

Consult Iris pull request check list

@pp-mo pp-mo changed the title Ultra-simple support for selective netcdf loading. Simple support for selective netcdf loading. Jun 3, 2021
@pp-mo pp-mo mentioned this pull request Jun 3, 2021
10 tasks
@bjlittle bjlittle self-assigned this Jun 4, 2021
@trexfeathers
Copy link
Contributor

I've stared at this for a while now, and it makes sense.

My main question is: what does this look like beyond MVP? There are obviously a lot of limitations and hard-coding in this implementation; how would you make it more generalisable? I.e. multiple constraints, non-NameConstraints, less code duplication? Since we don't know when we might have time to expand the functionality, it would be good to make sure the merged code is something that can be iterated on, rather than something that would need replacing.

You can ignore my question if this is far as we're expecting to take this functionality in the foreseeable future.

@pp-mo
Copy link
Member Author

pp-mo commented Jun 4, 2021

what does this look like beyond MVP?

Well, first we need to check that this is addressing the immediate needs of the ESMValTool devs, otherwise it's not worth it.

A simple list of constraints acts with an "OR"-like behaviour, so that is mostly a non-starter for a speedup solution.
The simplest exception to that is provided for in the PP implementation, where we handle a list of constraints all of which are of the form AttributeConstraint(STASH=<selected>).
You could in principle extend that to cases where all the top-level constraints in the list are formed as '&' -combinations (i.e. ConstraintCombination), and one of those is a type we can "translate".
But in the PP case we never even got that far.

Apart from that, I think in this case we can work out how to do the equivalent of a Constraint(name=string), allowing that we decode a possible STASH attribute on the variable.
So I think that's about all that is technically possible + we need anyway to first decide what is actually useful before committing more time.

@trexfeathers
Copy link
Contributor

trexfeathers commented Jun 4, 2021

Well, first we need to check that this is addressing the immediate needs of the ESMValTool devs, otherwise it's not worth it.

Yup, makes sense not to invest time thinking about the future yet. In that case do you think it would be worth including a note or todo in the docstring to indicate that it's a bit of an experiment? If it's valuable: we'll need to re-write it more generally, if it's not: we'll need to back it out.

Indeed, I'd quite like if we could move translate_constraints_to_var_callback to the experimental module.

@pp-mo
Copy link
Member Author

pp-mo commented Jun 7, 2021

move translate_constraints_to_var_callback to the experimental module.

Ok I don't think there's any point doing that directly, as it isn't a public thing
(though I can see it should maybe have a "_" -- will do that !)

My feeling is that this is (should be) an implementation shortcut guaranteed to have no effect on correctness I.E. the final result is the same in all cases.
Equally importantly, we can then provide it with no visible user control.
So if you can keep it like that, then it doesn't really matter which cases it covers and which not,
( except whether it addresses a performance need ).

@pp-mo
Copy link
Member Author

pp-mo commented Jun 7, 2021

Thanks for inputs @trexfeathers @bjlittle
Will address those points in due course.
But, I'm going to hold fire on this until it's clear whether it adresses the ESMValTool devs' need.

@pp-mo
Copy link
Member Author

pp-mo commented Jun 14, 2021

Some further practical discussion at ESMValGroup/ESMValCore#1180 ESMValGroup/ESMValCore#1153 (comment) (the discussion has been moved to its own issue).

In my reply , I tried to explain how the existing solution here is only a quick hack.

As-is, it only supports a constraint expression which is a single NameConstraint.
But as noted above, we already know how to support a list of several of those (done for PP).

Some possibly-useful further cases :

  • We should probably support a simple string as constraint --> Constraint(name) --> match any of : standard/long/var-name or STASH-attribute
    • (all-of which can still be found in raw netcdf data, so eminently possible)
  • a list of single constraints, of the types we can handle (NameConstraint or Constraint(string), aka "translatable constraints")
  • We could also support, for example, any &-combinations involving a single translatable-type-Constraint as one of the 'terms'.
  • We can also do AttributeConstraints if wanted.

But the approach does not generalise fully -- i.e. some types of expression which "could" be translated (+ therefore speeded up) will remain 'opaque' to this type of analysis and it won't be possible to do everything.
Most notably, anything involving a selection-function is obviously "opaque", and that includes 2 important common cases :

  • anything with a cube_func option
  • a CoordConstraint testing for a value-range, e.g. coord=lambda cell: min <= cell < max

Likewise, in any case where we are given a list of constraints, we need to be able to 'translate' everything on the list, or we can't select on data-variable at all.

@zklaus
Copy link

zklaus commented Jun 18, 2021

For us, the main use case is to load one variable from a multi-variable file. I think that is really all we need.

@pp-mo
Copy link
Member Author

pp-mo commented Jul 30, 2021

Status report

EMSValTool people have confirmed that they only need the simplest case of this (one named var from one file)
Reviewing of outstanding work, as we want to get this into Iris 3.1.0

  1. respond to outstanding (small) review comment, and any further
  2. fix for multiple constraints, as it is in PP code + should be easy (possibly combine in common code ??)
  3. likewise, should be easy to support Constraint(name) form in addition to NameConstraint
    • so should probably do that to avoid confusion (as NameConstraint is a bit 'specialist').
  4. add whatsnew

(2) and (3) can be dropped if there is any difficulty or hurry

I (@pp-mo) can tackle this in ~2 weeks when I return from holiday,
or someone else can do it ?

@lbdreyer lbdreyer force-pushed the nc_selective_loading branch from 8b17dfe to 5e5a0f4 Compare August 10, 2021 10:54
@lbdreyer lbdreyer marked this pull request as ready for review August 10, 2021 11:00
@lbdreyer lbdreyer self-assigned this Aug 10, 2021
@lbdreyer lbdreyer force-pushed the nc_selective_loading branch from 02c8b1b to 58399fd Compare August 10, 2021 11:03
@lbdreyer
Copy link
Member

lbdreyer commented Aug 10, 2021

I have taken this on whilst @pp-mo is on holiday.

Changes I have made:

@lbdreyer lbdreyer force-pushed the nc_selective_loading branch from 58399fd to fa5561b Compare August 10, 2021 12:05
@lbdreyer
Copy link
Member

Rebased to resolve merge conflict now that #4206 has gone in

Copy link
Member

@bjlittle bjlittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Awesome! Thanks for helping to nudge this PR across the line 🍻

Just a couple of comments to service, then we're good to go 🚀

docs/src/whatsnew/latest.rst Outdated Show resolved Hide resolved
lib/iris/fileformats/netcdf.py Show resolved Hide resolved
@lbdreyer
Copy link
Member

@bjlittle Thanks for your review! I believe I have now resolved the review actions you raised, so could you have a final check?

@bjlittle bjlittle merged commit 5aae571 into SciTools:main Aug 10, 2021
pp-mo added a commit to pp-mo/iris that referenced this pull request Aug 17, 2021
lbdreyer pushed a commit that referenced this pull request Aug 17, 2021
* Replace interim 3.0.x whatsnews with single v3.1.rst, with concrete version and date.

* Mark as unreleased, fix links.

* Rearrange entries tagged 'pre-3.1', and recategorise #4176 change.

* Moved bugfix notes from v3.1 to v3.0 whatsnew page.

* Fixed broken pull-request refs.

* Fix link to PyKE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants