-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask chunking control for netcdf loading. #4572
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to somehow allow Dask's special chunk values (
-1
,'auto'
andNone
), but I guess this is at odds with doing this logic on Iris side. Have you tried to inquire if Dask would be interested to add/extend chunking logic to support such use cases? I think it would make the most sense to boot this out of Iris as per the note here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(background, not really an answer!)
The problem with the existing Dask built-in strategy is that is treats all dimensions as equal and tries to even out the chunking across all dims. Which is pretty clearly unsuitable in most of our cases, given the known contiguity factors in file storage, and especially when the file contains 2d-fields which cannot be efficiently sub-indexed.
Which is why we provided the Iris version, to prioritise splitting of the "outer" dims and keeping the "inner" ones together.
But it's certainly true that we could feedback and engage, as-per the note you mention.
Unfortunately though, the dask code is really complicated and engagement seems a bit of an uphill struggle IMHO.
By which I mean ... I have repeatedly attempted to engage with the dask codebase, and maybe contribute, but I've found it very hard to understand, with a lot of unexplained terminology and framework concepts.
So it seems to me that there is a problem with contributing unless you are an "expert", because there is a lot that you need to know beyond what the "ordinary" user sees.
In fact, as a very relevant example, I considered addressing this same chunking control problem by picking apart the graph of a loaded cube to identify and adjust (or replace) the
iris.fileformats.netcdf.NetCDFDataProxy
objects which are in there. It's perfectly do-able, but you need to understand the structure of the dask graph and employ various assumptions about it : However, there is no public account of that, or any apparent intention for "ordinary users" to manipulate them in this way. So it would seem that a dask graph is really an "opaque" object, and the internal structure might get refactored and change completely, in which case such a solution, being outside intended usage, would simply need re-writing. So effectively, AFAICT only a dask "expert" is expected to work at that level and, crucially, such a solution only really makes sense if the code lives in the dask codebase.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder of how that is described :
From a technical POV, I guess we could respect -1, but the others are a little tricky :
"Auto" is sort-of what we are already doing, with the dims that don't have an explicit control given.
But it isn't quite what Dask means, since we will still want to apply our dimension ordering priority.
"None" I think could only really apply if we treat chunking specified on a file variable as the initial setting. At present, we do use that as a starting point anyway, so I'm not sure how this would differ from "auto" in this usage.
Can you suggest what the interpretations would be -- how would this work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't mean to suggest you make a PR to Dask, only that it may be good to have a conversation and make them aware of the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For None the only sensible interpretation I can think of is to keep the chunking from the netcdf file itself as you suggest, but dunno what Dask is doing in that case.
I reckon Auto would require repeating the logic that Dask is doing but restricted only to the specified dimensions. Seem quite tricky, probably best not to support it if the implementation is going to live on the Iris' side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I've re-read your comment and I think this is not what you suggest. My interpretation of None is that it behaves the same as if you would explicitly specify chunks equal to those already existing on the variable, i.e., this is their final rather than initial value.