-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask chunking control for netcdf loading. #4572
Conversation
API discussionThe API (reference docs-build here) is actually rather different from what I had expected to end up with. I reasoned that if loading several pieces of data, it will often make sense to align the chunking according to the dimensions. Assuming that totally dimension-oriented approach is not always correct though, you need a way to give different settings for different variables. Hence the var-name selective behaviour. It then made sense to me to apply common dimension settings to individual result cubes, which results in the logic shown. Roads not taken(1) I had expected to add loader kwargs, and involve the #3720 (2) It is possible to make settings apply to a filename or file-path ( |
Usage examples.This should really go in docs "somewhere" An example that is working for me is, to solve the problem referrred to here In that specific example ...
When applying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically set out to do this because I couldn't stop thinking about it.
That's always a good motivator 😉
First off, really glad to see progress on this. I think context manager makes a lot of sense for this given the many layers that these options would have to pass through. If I could have a wish it would be for some way to disable Iris' logic entirely and fallback on Dask and its machinery to manage the chunking.
``dask.config.set({'array.chunk-size': '250MiB'})``. | ||
|
||
""" | ||
old_settings = deepcopy(self.var_dim_chunksizes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use collections.ChainMap
here for stacking the settings
chunks = np.array(chunks) | ||
dims_fixed_arr = np.array(dims_fixed) | ||
# Reduce the target size by the fixed size of all the 'fixed' dims. | ||
point_size_limit = point_size_limit // np.prod(chunks[dims_fixed_arr]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to somehow allow Dask's special chunk values (-1
, 'auto'
and None
), but I guess this is at odds with doing this logic on Iris side. Have you tried to inquire if Dask would be interested to add/extend chunking logic to support such use cases? I think it would make the most sense to boot this out of Iris as per the note here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(background, not really an answer!)
The problem with the existing Dask built-in strategy is that is treats all dimensions as equal and tries to even out the chunking across all dims. Which is pretty clearly unsuitable in most of our cases, given the known contiguity factors in file storage, and especially when the file contains 2d-fields which cannot be efficiently sub-indexed.
Which is why we provided the Iris version, to prioritise splitting of the "outer" dims and keeping the "inner" ones together.
But it's certainly true that we could feedback and engage, as-per the note you mention.
Unfortunately though, the dask code is really complicated and engagement seems a bit of an uphill struggle IMHO.
By which I mean ... I have repeatedly attempted to engage with the dask codebase, and maybe contribute, but I've found it very hard to understand, with a lot of unexplained terminology and framework concepts.
So it seems to me that there is a problem with contributing unless you are an "expert", because there is a lot that you need to know beyond what the "ordinary" user sees.
In fact, as a very relevant example, I considered addressing this same chunking control problem by picking apart the graph of a loaded cube to identify and adjust (or replace) the iris.fileformats.netcdf.NetCDFDataProxy
objects which are in there. It's perfectly do-able, but you need to understand the structure of the dask graph and employ various assumptions about it : However, there is no public account of that, or any apparent intention for "ordinary users" to manipulate them in this way. So it would seem that a dask graph is really an "opaque" object, and the internal structure might get refactored and change completely, in which case such a solution, being outside intended usage, would simply need re-writing. So effectively, AFAICT only a dask "expert" is expected to work at that level and, crucially, such a solution only really makes sense if the code lives in the dask codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomekTrzeciak It would be nice to somehow allow Dask's special chunk values (-1, 'auto' and None),
Reminder of how that is described :
Chunks also includes three special values:
* -1: no chunking along this dimension
* None: no change to the chunking along this dimension (useful for rechunk)
* "auto": allow the chunking in this dimension to accommodate ideal chunk sizes
From a technical POV, I guess we could respect -1, but the others are a little tricky :
"Auto" is sort-of what we are already doing, with the dims that don't have an explicit control given.
But it isn't quite what Dask means, since we will still want to apply our dimension ordering priority.
"None" I think could only really apply if we treat chunking specified on a file variable as the initial setting. At present, we do use that as a starting point anyway, so I'm not sure how this would differ from "auto" in this usage.
Can you suggest what the interpretations would be -- how would this work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately though, the dask code is really complicated and engagement seems a bit of an uphill struggle IMHO.
By which I mean ... I have repeatedly attempted to engage with the dask codebase, and maybe contribute, but I've found it very hard to understand, with a lot of unexplained terminology and framework concepts.
Oh, I didn't mean to suggest you make a PR to Dask, only that it may be good to have a conversation and make them aware of the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a technical POV, I guess we could respect -1, but the others are a little tricky :
"Auto" is sort-of what we are already doing, with the dims that don't have an explicit control given. But it isn't quite what Dask means, since we will still want to apply our dimension ordering priority.
"None" I think could only really apply if we treat chunking specified on a file variable as the initial setting. At present, we do use that as a starting point anyway, so I'm not sure how this would differ from "auto" in this usage.Can you suggest what the interpretations would be -- how would this work ?
For None the only sensible interpretation I can think of is to keep the chunking from the netcdf file itself as you suggest, but dunno what Dask is doing in that case.
I reckon Auto would require repeating the logic that Dask is doing but restricted only to the specified dimensions. Seem quite tricky, probably best not to support it if the implementation is going to live on the Iris' side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"None" I think could only really apply if we treat chunking specified on a file variable as the initial setting. At present, we do use that as a starting point anyway, so I'm not sure how this would differ from "auto" in this usage.
For None the only sensible interpretation I can think of is to keep the chunking from the netcdf file itself as you suggest, but dunno what Dask is doing in that case.
Sorry, I've re-read your comment and I think this is not what you suggest. My interpretation of None is that it behaves the same as if you would explicitly specify chunks equal to those already existing on the variable, i.e., this is their final rather than initial value.
|
||
#: A :class:`ChunkControl` object providing user-control of Dask chunking | ||
#: when Iris loads netcdf files. | ||
CHUNK_CONTROL: ChunkControl = ChunkControl() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I would go with lower case here, CHUNK_CONTROL.set(...)
looks a bit shouty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalised variables are an easy-to-notice way of indicating module level constants, and I think it's common practice in Python, certainly it's common throughout Iris.
I agree with you it's a bit too shouty but that's overridden by the value of having this shared convention IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this convention mostly for simple things with "fixed" value, rather than objects with complex behaviour like context managers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. In that case I've messed up elsewhere! Look forward to seeing what @pp-mo has to say 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to do a drive-by nitpick as I'm getting all the email alerts for this thread in my inbox :)
There should be no need for the type hint here, Python type hinter should be smart enough to know that
CHUNK_CONTROL = ChunkControl()
is indeed a ChunkControl
instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomekTrzeciak Nitpick
As regards captilisation, I was basically mimicking iris.FUTURE.
Which makes some kind of sense IMHO as this is very much the same style of thing.
On the other hand it's possible that, when that was introduced, we were slightly conflicted about the difference between a "constant value" and a "global variable", as you suggest, since in Python there is essentially no actual distinction.
@jamesp drive-by nitpick
It wasn't to enable type hinting, but trying to inject type into the docstring representation.
https://pp-iris.readthedocs.io/en/nc_chunk_control/generated/api/iris/fileformats/netcdf.html#iris.fileformats.netcdf.CHUNK_CONTROL
Hence the colon. But it doesn't seem to actually do so, presumably because Sphinx is not so configured 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was basically mimicking iris.FUTURE.
Ooh, but then we have :
iris.fileformats.um.structured_um_loading
Which is a generator-context-manager, rather than a flag-controlling object..
So it is used as with structured_um_loading(): ...
,
But it does with STRUCTURED_LOAD_CONTROLS.context(loads_use_structured=True): ...
and isinstance(chunksize, int) | ||
): | ||
msg = ( | ||
"'dimension_chunksizes' kwargs should be an iterable " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: fix this : it's not an iterable in the API, it's a dictionary..
apply the ``dimension_chunksizes`` controls only to these variables, | ||
or when building cubes from these data variables. | ||
If None (the default), settings apply to all loaded variables. | ||
dimension_chunksizes : dict: str --> int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: this might be extended to more complex chunking controls, if useful?
Probably could do nonuniform sizes within a dimension (e.g. (3, 5, 5, 2)), but not across multiple dims (not even sure how you say that, but I think is possible).
Plus None/auto/-1 as already suggested
with CHUNK_CONTROL.set(self.cube_varname, model_level_number=2): | ||
cube = iris.load_cube(self.tempfile_path, self.cube_varname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
As noted on parent issue #3333 , xarray already provides a (slightly less sophisticated) chunking control, and could be of use via #4994 |
Aims to address #3333
N.B. there is a reference docs-build here
that shows the API
Still a bit preliminary, testing and documentation probably need more doing.
But here it is for discussion. I basically set out to do this because I couldn't stop thinking about it.
@cpelley @TomekTrzeciak @hdyson
I will also add some comments in the nature of a discussion...
Some of that as pointers for things we might want changed, and some as a replacement for missing documentation (?!)