-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask: Data.__getitem__
, Data.__setitem__
#257
dask: Data.__getitem__
, Data.__setitem__
#257
Conversation
Data.__getitem__
Data.__getitem__
, Data.__setitem__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this looks good, works well, and the new __getitem__
unit test is excellent. I've asked a few questions in-line (though feel free to answer those in person when we can next chat rather than writing up a response here) and there's one docstring that seems to need a minor tweak, as noted below, so ideally those can be addressed before merging.
Once this is merged I will make notes based on our previous LAMA -> Dask discussions, notably from the latest one:
- the need to manage any methods appropriately where the hardness of mask may change; and
- the problem and current best solution for a potentially changing
.shape
to prevent computing that unless unavoidable.
I will also update our master listing of daskified methods.
@@ -13325,3 +13579,54 @@ def _broadcast(a, shape): | |||
tile = shape[0 : len(shape) - len(a_shape)] + tuple(tile[::-1]) | |||
|
|||
return np.tile(a, tile) | |||
|
|||
|
|||
"""Dask utilities to be called on chunks""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea adding such lines to help us organise all these (especially new) methods. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think these could also be in their own file, instead of at the bottom of data.py
? If you've got strong thoughts on that we'll go with those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think these could also be in their own file
Definitely, even better in fact, because there are currently ~13,500 lines in that module which is far too large in my opinion. I say we go with any reasonable means to lift out methods and move them into their own module(s), unless you disagree! So please do move them, in a new PR or otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I'll do that in a new PR.
@@ -5914,9 +6179,13 @@ def size(self): | |||
|
|||
""" | |||
dx = self._get_dask() | |||
return dx.size | |||
size = dx.size | |||
if math.isnan(size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you have preferred math.isnan
over numpy.isnan
here (and in a few other places, I see) because it works in these non-array / pure number cases and takes less memory? Assuming so, I've noted down to make the same choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know! dask itself uses both, in various places, and I'm not sure what the difference is, math.isnan
is ~4 times faster when size is nan, but they're about the same speed when size is an integer . So I guess math.isnan
is a good choice in general in such circumstances ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I guess the question arose because I would go for np.isnan
by default.
So I guess math.isnan is a good choice in general in such circumstances ...
Shall we try to do this for now (math.isnan
if it covers all bases for a given case, else use np
) and I will make a note that towards the end of the daskification, as part of tidying work, we can quickly review which is used in any .isnan
case we end up needing (as you point out it may not be significant but it would also take ~5 mins to change, so why not try to choose an overall slightly faster one)?
|
||
.. warning:: Never change the `_cyclic` attribute in-place. | ||
.. warning:: Never change the value of the `_cyclic` attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we forbid _cyclic
from being changed instead of warning strongly against it? Or is it perhaps not possible to do so due to some facet of laziness or similar? Or just not a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Good idea. we could get rid of @setter._cyclic
, and instead of exposing some nasty internals (i.e. the custom dictionary) we could re-bury them in a setter method: def _set_cyclic(self, value)
. Let's open another PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good plan to me! Do you want to do that PR, or shall I (I am happy to, but you may already have started and/or want to implement it yourself)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All yours, thanks.
I think the outstanding items have all been shunted to new PRs. If you agree, please merge ... Thanks, Sadie, for the careful review, as ever. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the outstanding items have all been shunted to new PRs
Indeed, this is now the case. Good to merge! Thanks David.
No description provided.