Dask chunking control for netcdf loading. #4572

pp-mo · 2022-02-08T20:52:34Z

Aims to address #3333

N.B. there is a reference docs-build here
that shows the API

Still a bit preliminary, testing and documentation probably need more doing.
But here it is for discussion. I basically set out to do this because I couldn't stop thinking about it.
@cpelley @TomekTrzeciak @hdyson

I will also add some comments in the nature of a discussion...
Some of that as pointers for things we might want changed, and some as a replacement for missing documentation (?!)

pp-mo · 2022-02-08T21:08:35Z

API discussion

The API (reference docs-build here) is actually rather different from what I had expected to end up with.
Here are some reasons..

I reasoned that if loading several pieces of data, it will often make sense to align the chunking according to the dimensions.
That could be if there are several datacubes wiith same dimensions. But also applies to the cube components, e.g. multiple AuxCoords.
The logic of using dimensions is that different AuxCoords (for instance) may not all have the same dimensions, but if arithmetic might be performed between different ones, aligning chunks will make good sense. Whereas, a full 'chunks' argument touching all the dimensions can only apply to variables with a specific set of dimensions.
Also, by pinning only certain dimensions, I can get the automatic operation to optimise the sizes in the other dims.
E.G. if I had automatic chunks of (1, 1, 100, 1000, 2000) ; if I then pin the 3rd dim to 3 I will then get (1, 33, 3, 1000, 2000) instead, which is still "optimally" sized.
(TBD: this code needs properly testing !!)

Assuming that totally dimension-oriented approach is not always correct though, you need a way to give different settings for different variables. Hence the var-name selective behaviour.

It then made sense to me to apply common dimension settings to individual result cubes, which results in the logic shown.

Roads not taken

(1) I had expected to add loader kwargs, and involve the #3720
However, I instead took ideas from the unstructured data loading control to provide a context-manager approach.
One big advantage of this is that you can just invoke existing code, with maybe multiple load calls, without modifying the existing code at all.

(2) It is possible to make settings apply to a filename or file-path (pathlib.PurePath.match seems a good way).
This would obviously enable more selective usage when loading from a set of files, which might be needed if different files have vars with the same names.
I trialled this, but its only really losing you an automatic merge, whereas it makes for more complexity + explaining, that I'm not sure is worth it.
( Whereas, we can't reasonably drop the var-name selectivity since we can't avoid scanning multiple variables from a single file. )

pp-mo · 2022-02-08T21:27:57Z

Usage examples.

This should really go in docs "somewhere"
For now, I put usage examples in some stopgap integration tests.

An example that is working for me is, to solve the problem referrred to here

In that specific example ...

data was loaded from multiple 1-timestep files
- with dims (realization: 18; pressure: 33; latitude: 960; longitude: 1280)
by loading multiple files -- e.g. 10 -- and with the auto-merge in load, we get a single cube
- with dims (time:10, realization: 18; pressure: 33; latitude: 960; longitude: 1280)
- which is chunked as (1, 1, 17, 960, 1280)
  - (the '17' is determined by the dask default chunksize).
the user then wants only one pressure-level, so cube[:, 0]
- --> (time:10, realization: 18; latitude: 960; longitude: 1280)
- chunked as (1, 1, 960, 1280)
but attempting to fetch that ~200Mb of data (in ~1Mb chunks) consumes several Gb of data, and can crash the machine
.. alternatively, saving it to a file (dask streaming) works but is extremely slow (20x similar data)

When applying CHUNK_CONTROL.set(pressure=1) to this, the chunking of the original merged cube changes
from (1, 1, 17, 960, 1280)
to (1, 17, 1, 960, 1280)
So that a chunk spans several realization points, but only one pressure level.
This then works much better.

TomekTrzeciak

I basically set out to do this because I couldn't stop thinking about it.

That's always a good motivator 😉

First off, really glad to see progress on this. I think context manager makes a lot of sense for this given the many layers that these options would have to pass through. If I could have a wish it would be for some way to disable Iris' logic entirely and fallback on Dask and its machinery to manage the chunking.

TomekTrzeciak · 2022-02-15T13:05:44Z

lib/iris/fileformats/netcdf.py

+        ``dask.config.set({'array.chunk-size': '250MiB'})``.
+
+        """
+        old_settings = deepcopy(self.var_dim_chunksizes)


you could use collections.ChainMap here for stacking the settings

TomekTrzeciak · 2022-02-18T14:38:03Z

lib/iris/_lazy_data.py

+        chunks = np.array(chunks)
+        dims_fixed_arr = np.array(dims_fixed)
+        # Reduce the target size by the fixed size of all the 'fixed' dims.
+        point_size_limit = point_size_limit // np.prod(chunks[dims_fixed_arr])


It would be nice to somehow allow Dask's special chunk values (-1, 'auto' and None), but I guess this is at odds with doing this logic on Iris side. Have you tried to inquire if Dask would be interested to add/extend chunking logic to support such use cases? I think it would make the most sense to boot this out of Iris as per the note here.

(background, not really an answer!)

The problem with the existing Dask built-in strategy is that is treats all dimensions as equal and tries to even out the chunking across all dims. Which is pretty clearly unsuitable in most of our cases, given the known contiguity factors in file storage, and especially when the file contains 2d-fields which cannot be efficiently sub-indexed.
Which is why we provided the Iris version, to prioritise splitting of the "outer" dims and keeping the "inner" ones together.

But it's certainly true that we could feedback and engage, as-per the note you mention.
Unfortunately though, the dask code is really complicated and engagement seems a bit of an uphill struggle IMHO.
By which I mean ... I have repeatedly attempted to engage with the dask codebase, and maybe contribute, but I've found it very hard to understand, with a lot of unexplained terminology and framework concepts.
So it seems to me that there is a problem with contributing unless you are an "expert", because there is a lot that you need to know beyond what the "ordinary" user sees.

In fact, as a very relevant example, I considered addressing this same chunking control problem by picking apart the graph of a loaded cube to identify and adjust (or replace) the iris.fileformats.netcdf.NetCDFDataProxy objects which are in there. It's perfectly do-able, but you need to understand the structure of the dask graph and employ various assumptions about it : However, there is no public account of that, or any apparent intention for "ordinary users" to manipulate them in this way. So it would seem that a dask graph is really an "opaque" object, and the internal structure might get refactored and change completely, in which case such a solution, being outside intended usage, would simply need re-writing. So effectively, AFAICT only a dask "expert" is expected to work at that level and, crucially, such a solution only really makes sense if the code lives in the dask codebase.

@TomekTrzeciak It would be nice to somehow allow Dask's special chunk values (-1, 'auto' and None),

Reminder of how that is described :

Chunks also includes three special values: * -1: no chunking along this dimension * None: no change to the chunking along this dimension (useful for rechunk) * "auto": allow the chunking in this dimension to accommodate ideal chunk sizes

From a technical POV, I guess we could respect -1, but the others are a little tricky :
"Auto" is sort-of what we are already doing, with the dims that don't have an explicit control given.
But it isn't quite what Dask means, since we will still want to apply our dimension ordering priority.
"None" I think could only really apply if we treat chunking specified on a file variable as the initial setting. At present, we do use that as a starting point anyway, so I'm not sure how this would differ from "auto" in this usage.

Can you suggest what the interpretations would be -- how would this work ?

Unfortunately though, the dask code is really complicated and engagement seems a bit of an uphill struggle IMHO.
By which I mean ... I have repeatedly attempted to engage with the dask codebase, and maybe contribute, but I've found it very hard to understand, with a lot of unexplained terminology and framework concepts.

Oh, I didn't mean to suggest you make a PR to Dask, only that it may be good to have a conversation and make them aware of the issue.

From a technical POV, I guess we could respect -1, but the others are a little tricky :
"Auto" is sort-of what we are already doing, with the dims that don't have an explicit control given. But it isn't quite what Dask means, since we will still want to apply our dimension ordering priority.
"None" I think could only really apply if we treat chunking specified on a file variable as the initial setting. At present, we do use that as a starting point anyway, so I'm not sure how this would differ from "auto" in this usage.

Can you suggest what the interpretations would be -- how would this work ?

For None the only sensible interpretation I can think of is to keep the chunking from the netcdf file itself as you suggest, but dunno what Dask is doing in that case.

I reckon Auto would require repeating the logic that Dask is doing but restricted only to the specified dimensions. Seem quite tricky, probably best not to support it if the implementation is going to live on the Iris' side.

"None" I think could only really apply if we treat chunking specified on a file variable as the initial setting. At present, we do use that as a starting point anyway, so I'm not sure how this would differ from "auto" in this usage.

For None the only sensible interpretation I can think of is to keep the chunking from the netcdf file itself as you suggest, but dunno what Dask is doing in that case.

Sorry, I've re-read your comment and I think this is not what you suggest. My interpretation of None is that it behaves the same as if you would explicitly specify chunks equal to those already existing on the variable, i.e., this is their final rather than initial value.

TomekTrzeciak · 2022-02-18T14:39:41Z

lib/iris/fileformats/netcdf.py

+
+#: A :class:`ChunkControl` object providing user-control of Dask chunking
+#: when Iris loads netcdf files.
+CHUNK_CONTROL: ChunkControl = ChunkControl()


Nitpick: I would go with lower case here, CHUNK_CONTROL.set(...) looks a bit shouty.

Capitalised variables are an easy-to-notice way of indicating module level constants, and I think it's common practice in Python, certainly it's common throughout Iris.

I agree with you it's a bit too shouty but that's overridden by the value of having this shared convention IMHO.

Isn't this convention mostly for simple things with "fixed" value, rather than objects with complex behaviour like context managers?

Oh. In that case I've messed up elsewhere! Look forward to seeing what @pp-mo has to say 😅

I'm going to do a drive-by nitpick as I'm getting all the email alerts for this thread in my inbox :)

There should be no need for the type hint here, Python type hinter should be smart enough to know that

CHUNK_CONTROL = ChunkControl()

is indeed a ChunkControl instance.

@TomekTrzeciak Nitpick

As regards captilisation, I was basically mimicking iris.FUTURE.
Which makes some kind of sense IMHO as this is very much the same style of thing.

On the other hand it's possible that, when that was introduced, we were slightly conflicted about the difference between a "constant value" and a "global variable", as you suggest, since in Python there is essentially no actual distinction.

@jamesp drive-by nitpick

It wasn't to enable type hinting, but trying to inject type into the docstring representation.
https://pp-iris.readthedocs.io/en/nc_chunk_control/generated/api/iris/fileformats/netcdf.html#iris.fileformats.netcdf.CHUNK_CONTROL
Hence the colon. But it doesn't seem to actually do so, presumably because Sphinx is not so configured 😞

I was basically mimicking iris.FUTURE.

Ooh, but then we have :
iris.fileformats.um.structured_um_loading
Which is a generator-context-manager, rather than a flag-controlling object..
So it is used as with structured_um_loading(): ... ,
But it does with STRUCTURED_LOAD_CONTROLS.context(loads_use_structured=True): ...

pp-mo · 2022-03-24T10:30:05Z

lib/iris/fileformats/netcdf.py

+                        and isinstance(chunksize, int)
+                    ):
+                        msg = (
+                            "'dimension_chunksizes' kwargs should be an iterable "


TODO: fix this : it's not an iterable in the API, it's a dictionary..

pp-mo · 2022-03-24T10:38:44Z

lib/iris/fileformats/netcdf.py

+            apply the ``dimension_chunksizes`` controls only to these variables,
+            or when building cubes from these data variables.
+            If None (the default), settings apply to all loaded variables.
+        dimension_chunksizes : dict: str --> int


TODO: this might be extended to more complex chunking controls, if useful?
Probably could do nonuniform sizes within a dimension (e.g. (3, 5, 5, 2)), but not across multiple dims (not even sure how you say that, but I think is possible).
Plus None/auto/-1 as already suggested

cpelley · 2022-06-07T12:10:20Z

lib/iris/tests/integration/test_netcdf__chunk_control.py

+        with CHUNK_CONTROL.set(self.cube_varname, model_level_number=2):
+            cube = iris.load_cube(self.tempfile_path, self.cube_varname)


pp-mo · 2022-10-27T13:47:56Z

As noted on parent issue #3333 , xarray already provides a (slightly less sophisticated) chunking control, and could be of use via #4994

trexfeathers · 2023-07-28T16:28:10Z

We're going to make this happen! But since this PR has gone rather stale, and @pp-mo has a lot of assigned work, we're going to close this PR and aim to apply the code to the latest Iris. See #5398 for the task breakdown.

Dask chunking control for netcdf loading.

93b0917

TomekTrzeciak reviewed Feb 18, 2022

View reviewed changes

pp-mo mentioned this pull request Mar 14, 2022

Load is VERY slow for a NetCDF multi-variable file #4134

Closed

pp-mo commented Mar 24, 2022

View reviewed changes

trexfeathers mentioned this pull request May 9, 2022

Poor performance of iris with latest dask #4736

Open

cpelley reviewed Jun 7, 2022

View reviewed changes

pp-mo added the Feature: ESMValTool label Sep 8, 2022

valeriupredoi mentioned this pull request Sep 30, 2022

Ordering issues labelled Feature: ESMValTool from iris ESMValGroup/ESMValCore#1738

Open

This was referenced Jul 28, 2023

Implement NetCDF loading chunk control #5398

Closed

Finish NetCDF loading chunk control code #5399

Closed

trexfeathers closed this Jul 28, 2023

ESadek-MO mentioned this pull request Nov 6, 2023

Merge chunk control code into latest iris #5565

Merged

This was referenced Nov 14, 2023

Corrected required type of dimension_chunksizes #5581

Merged

Chunk control modes #5575

Merged

Chunk Control Tests #5583

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask chunking control for netcdf loading. #4572

Dask chunking control for netcdf loading. #4572

pp-mo commented Feb 8, 2022 •

edited

Loading

pp-mo commented Feb 8, 2022 •

edited

Loading

pp-mo commented Feb 8, 2022 •

edited

Loading

TomekTrzeciak left a comment

TomekTrzeciak Feb 15, 2022

TomekTrzeciak Feb 18, 2022

pp-mo Feb 24, 2022

pp-mo Feb 24, 2022 •

edited

Loading

TomekTrzeciak Mar 3, 2022

TomekTrzeciak Mar 3, 2022

TomekTrzeciak Mar 3, 2022

TomekTrzeciak Feb 18, 2022

trexfeathers Feb 18, 2022

TomekTrzeciak Feb 18, 2022 •

edited

Loading

trexfeathers Feb 18, 2022

jamesp Feb 18, 2022

pp-mo Feb 24, 2022

pp-mo Feb 24, 2022 •

edited

Loading

pp-mo Mar 24, 2022

pp-mo Mar 24, 2022

cpelley Jun 7, 2022

pp-mo commented Oct 27, 2022 •

edited

Loading

trexfeathers commented Jul 28, 2023

		with CHUNK_CONTROL.set(self.cube_varname, model_level_number=2):
		cube = iris.load_cube(self.tempfile_path, self.cube_varname)

Dask chunking control for netcdf loading. #4572

Dask chunking control for netcdf loading. #4572

Conversation

pp-mo commented Feb 8, 2022 • edited Loading

pp-mo commented Feb 8, 2022 • edited Loading

API discussion

Roads not taken

pp-mo commented Feb 8, 2022 • edited Loading

Usage examples.

TomekTrzeciak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

(background, not really an answer!)

pp-mo Feb 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomekTrzeciak Feb 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo Feb 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo commented Oct 27, 2022 • edited Loading

trexfeathers commented Jul 28, 2023

pp-mo commented Feb 8, 2022 •

edited

Loading

pp-mo commented Feb 8, 2022 •

edited

Loading

pp-mo commented Feb 8, 2022 •

edited

Loading

pp-mo Feb 24, 2022 •

edited

Loading

TomekTrzeciak Feb 18, 2022 •

edited

Loading

pp-mo Feb 24, 2022 •

edited

Loading

pp-mo commented Oct 27, 2022 •

edited

Loading