Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask chunking control for netcdf loading. #4572

Closed
wants to merge 1 commit into from

Conversation

pp-mo
Copy link
Member

@pp-mo pp-mo commented Feb 8, 2022

Aims to address #3333

N.B. there is a reference docs-build here
that shows the API

Still a bit preliminary, testing and documentation probably need more doing.
But here it is for discussion. I basically set out to do this because I couldn't stop thinking about it.
@cpelley @TomekTrzeciak @hdyson

I will also add some comments in the nature of a discussion...
Some of that as pointers for things we might want changed, and some as a replacement for missing documentation (?!)

@pp-mo
Copy link
Member Author

pp-mo commented Feb 8, 2022

API discussion

The API (reference docs-build here) is actually rather different from what I had expected to end up with.
Here are some reasons..

I reasoned that if loading several pieces of data, it will often make sense to align the chunking according to the dimensions.
That could be if there are several datacubes wiith same dimensions. But also applies to the cube components, e.g. multiple AuxCoords.
The logic of using dimensions is that different AuxCoords (for instance) may not all have the same dimensions, but if arithmetic might be performed between different ones, aligning chunks will make good sense. Whereas, a full 'chunks' argument touching all the dimensions can only apply to variables with a specific set of dimensions.
Also, by pinning only certain dimensions, I can get the automatic operation to optimise the sizes in the other dims.
E.G. if I had automatic chunks of (1, 1, 100, 1000, 2000) ; if I then pin the 3rd dim to 3 I will then get (1, 33, 3, 1000, 2000) instead, which is still "optimally" sized.
(TBD: this code needs properly testing !!)

Assuming that totally dimension-oriented approach is not always correct though, you need a way to give different settings for different variables. Hence the var-name selective behaviour.

It then made sense to me to apply common dimension settings to individual result cubes, which results in the logic shown.

Roads not taken

(1) I had expected to add loader kwargs, and involve the #3720
However, I instead took ideas from the unstructured data loading control to provide a context-manager approach.
One big advantage of this is that you can just invoke existing code, with maybe multiple load calls, without modifying the existing code at all.

(2) It is possible to make settings apply to a filename or file-path (pathlib.PurePath.match seems a good way).
This would obviously enable more selective usage when loading from a set of files, which might be needed if different files have vars with the same names.
I trialled this, but its only really losing you an automatic merge, whereas it makes for more complexity + explaining, that I'm not sure is worth it.
( Whereas, we can't reasonably drop the var-name selectivity since we can't avoid scanning multiple variables from a single file. )

@pp-mo
Copy link
Member Author

pp-mo commented Feb 8, 2022

Usage examples.

This should really go in docs "somewhere"
For now, I put usage examples in some stopgap integration tests.

An example that is working for me is, to solve the problem referrred to here

In that specific example ...

  • data was loaded from multiple 1-timestep files
    • with dims (realization: 18; pressure: 33; latitude: 960; longitude: 1280)
  • by loading multiple files -- e.g. 10 -- and with the auto-merge in load, we get a single cube
    • with dims (time:10, realization: 18; pressure: 33; latitude: 960; longitude: 1280)
    • which is chunked as (1, 1, 17, 960, 1280)
      • (the '17' is determined by the dask default chunksize).
  • the user then wants only one pressure-level, so cube[:, 0]
    • --> (time:10, realization: 18; latitude: 960; longitude: 1280)
    • chunked as (1, 1, 960, 1280)
  • but attempting to fetch that ~200Mb of data (in ~1Mb chunks) consumes several Gb of data, and can crash the machine
  • .. alternatively, saving it to a file (dask streaming) works but is extremely slow (20x similar data)

When applying CHUNK_CONTROL.set(pressure=1) to this, the chunking of the original merged cube changes
from (1, 1, 17, 960, 1280)
to (1, 17, 1, 960, 1280)
So that a chunk spans several realization points, but only one pressure level.
This then works much better.

Copy link

@TomekTrzeciak TomekTrzeciak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically set out to do this because I couldn't stop thinking about it.

That's always a good motivator 😉

First off, really glad to see progress on this. I think context manager makes a lot of sense for this given the many layers that these options would have to pass through. If I could have a wish it would be for some way to disable Iris' logic entirely and fallback on Dask and its machinery to manage the chunking.

``dask.config.set({'array.chunk-size': '250MiB'})``.

"""
old_settings = deepcopy(self.var_dim_chunksizes)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use collections.ChainMap here for stacking the settings

chunks = np.array(chunks)
dims_fixed_arr = np.array(dims_fixed)
# Reduce the target size by the fixed size of all the 'fixed' dims.
point_size_limit = point_size_limit // np.prod(chunks[dims_fixed_arr])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to somehow allow Dask's special chunk values (-1, 'auto' and None), but I guess this is at odds with doing this logic on Iris side. Have you tried to inquire if Dask would be interested to add/extend chunking logic to support such use cases? I think it would make the most sense to boot this out of Iris as per the note here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(background, not really an answer!)

The problem with the existing Dask built-in strategy is that is treats all dimensions as equal and tries to even out the chunking across all dims. Which is pretty clearly unsuitable in most of our cases, given the known contiguity factors in file storage, and especially when the file contains 2d-fields which cannot be efficiently sub-indexed.
Which is why we provided the Iris version, to prioritise splitting of the "outer" dims and keeping the "inner" ones together.

But it's certainly true that we could feedback and engage, as-per the note you mention.
Unfortunately though, the dask code is really complicated and engagement seems a bit of an uphill struggle IMHO.
By which I mean ... I have repeatedly attempted to engage with the dask codebase, and maybe contribute, but I've found it very hard to understand, with a lot of unexplained terminology and framework concepts.
So it seems to me that there is a problem with contributing unless you are an "expert", because there is a lot that you need to know beyond what the "ordinary" user sees.

In fact, as a very relevant example, I considered addressing this same chunking control problem by picking apart the graph of a loaded cube to identify and adjust (or replace) the iris.fileformats.netcdf.NetCDFDataProxy objects which are in there. It's perfectly do-able, but you need to understand the structure of the dask graph and employ various assumptions about it : However, there is no public account of that, or any apparent intention for "ordinary users" to manipulate them in this way. So it would seem that a dask graph is really an "opaque" object, and the internal structure might get refactored and change completely, in which case such a solution, being outside intended usage, would simply need re-writing. So effectively, AFAICT only a dask "expert" is expected to work at that level and, crucially, such a solution only really makes sense if the code lives in the dask codebase.

Copy link
Member Author

@pp-mo pp-mo Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomekTrzeciak It would be nice to somehow allow Dask's special chunk values (-1, 'auto' and None),

Reminder of how that is described :

Chunks also includes three special values:
   * -1: no chunking along this dimension
   * None: no change to the chunking along this dimension (useful for rechunk)
   * "auto": allow the chunking in this dimension to accommodate ideal chunk sizes

From a technical POV, I guess we could respect -1, but the others are a little tricky :
"Auto" is sort-of what we are already doing, with the dims that don't have an explicit control given.
But it isn't quite what Dask means, since we will still want to apply our dimension ordering priority.
"None" I think could only really apply if we treat chunking specified on a file variable as the initial setting. At present, we do use that as a starting point anyway, so I'm not sure how this would differ from "auto" in this usage.

Can you suggest what the interpretations would be -- how would this work ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately though, the dask code is really complicated and engagement seems a bit of an uphill struggle IMHO.
By which I mean ... I have repeatedly attempted to engage with the dask codebase, and maybe contribute, but I've found it very hard to understand, with a lot of unexplained terminology and framework concepts.

Oh, I didn't mean to suggest you make a PR to Dask, only that it may be good to have a conversation and make them aware of the issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a technical POV, I guess we could respect -1, but the others are a little tricky :
"Auto" is sort-of what we are already doing, with the dims that don't have an explicit control given. But it isn't quite what Dask means, since we will still want to apply our dimension ordering priority.
"None" I think could only really apply if we treat chunking specified on a file variable as the initial setting. At present, we do use that as a starting point anyway, so I'm not sure how this would differ from "auto" in this usage.

Can you suggest what the interpretations would be -- how would this work ?

For None the only sensible interpretation I can think of is to keep the chunking from the netcdf file itself as you suggest, but dunno what Dask is doing in that case.

I reckon Auto would require repeating the logic that Dask is doing but restricted only to the specified dimensions. Seem quite tricky, probably best not to support it if the implementation is going to live on the Iris' side.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"None" I think could only really apply if we treat chunking specified on a file variable as the initial setting. At present, we do use that as a starting point anyway, so I'm not sure how this would differ from "auto" in this usage.

For None the only sensible interpretation I can think of is to keep the chunking from the netcdf file itself as you suggest, but dunno what Dask is doing in that case.

Sorry, I've re-read your comment and I think this is not what you suggest. My interpretation of None is that it behaves the same as if you would explicitly specify chunks equal to those already existing on the variable, i.e., this is their final rather than initial value.


#: A :class:`ChunkControl` object providing user-control of Dask chunking
#: when Iris loads netcdf files.
CHUNK_CONTROL: ChunkControl = ChunkControl()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I would go with lower case here, CHUNK_CONTROL.set(...) looks a bit shouty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalised variables are an easy-to-notice way of indicating module level constants, and I think it's common practice in Python, certainly it's common throughout Iris.

I agree with you it's a bit too shouty but that's overridden by the value of having this shared convention IMHO.

Copy link

@TomekTrzeciak TomekTrzeciak Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this convention mostly for simple things with "fixed" value, rather than objects with complex behaviour like context managers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. In that case I've messed up elsewhere! Look forward to seeing what @pp-mo has to say 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to do a drive-by nitpick as I'm getting all the email alerts for this thread in my inbox :)

There should be no need for the type hint here, Python type hinter should be smart enough to know that

CHUNK_CONTROL = ChunkControl()

is indeed a ChunkControl instance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomekTrzeciak Nitpick

As regards captilisation, I was basically mimicking iris.FUTURE.
Which makes some kind of sense IMHO as this is very much the same style of thing.

On the other hand it's possible that, when that was introduced, we were slightly conflicted about the difference between a "constant value" and a "global variable", as you suggest, since in Python there is essentially no actual distinction.


@jamesp drive-by nitpick

It wasn't to enable type hinting, but trying to inject type into the docstring representation.
https://pp-iris.readthedocs.io/en/nc_chunk_control/generated/api/iris/fileformats/netcdf.html#iris.fileformats.netcdf.CHUNK_CONTROL
Hence the colon. But it doesn't seem to actually do so, presumably because Sphinx is not so configured 😞

Copy link
Member Author

@pp-mo pp-mo Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was basically mimicking iris.FUTURE.

Ooh, but then we have :
iris.fileformats.um.structured_um_loading
Which is a generator-context-manager, rather than a flag-controlling object..
So it is used as with structured_um_loading(): ... ,
But it does with STRUCTURED_LOAD_CONTROLS.context(loads_use_structured=True): ...

and isinstance(chunksize, int)
):
msg = (
"'dimension_chunksizes' kwargs should be an iterable "
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: fix this : it's not an iterable in the API, it's a dictionary..

apply the ``dimension_chunksizes`` controls only to these variables,
or when building cubes from these data variables.
If None (the default), settings apply to all loaded variables.
dimension_chunksizes : dict: str --> int
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: this might be extended to more complex chunking controls, if useful?
Probably could do nonuniform sizes within a dimension (e.g. (3, 5, 5, 2)), but not across multiple dims (not even sure how you say that, but I think is possible).
Plus None/auto/-1 as already suggested

Comment on lines +70 to +71
with CHUNK_CONTROL.set(self.cube_varname, model_level_number=2):
cube = iris.load_cube(self.tempfile_path, self.cube_varname)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pp-mo
Copy link
Member Author

pp-mo commented Oct 27, 2022

As noted on parent issue #3333 , xarray already provides a (slightly less sophisticated) chunking control, and could be of use via #4994

@trexfeathers
Copy link
Contributor

We're going to make this happen! But since this PR has gone rather stale, and @pp-mo has a lot of assigned work, we're going to close this PR and aim to apply the code to the latest Iris. See #5398 for the task breakdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants