Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask: Data.__getitem__, Data.__setitem__ #257

Merged
merged 12 commits into from
Oct 4, 2021

Conversation

davidhassell
Copy link
Collaborator

No description provided.

@davidhassell davidhassell changed the title Daskify __getitem__ Data.__getitem__ Sep 6, 2021
@davidhassell davidhassell changed the title Data.__getitem__ dask: Data.__getitem__, Data.__setitem__ Sep 6, 2021
@davidhassell davidhassell added the dask Relating to the use of Dask label Sep 30, 2021
Copy link
Member

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this looks good, works well, and the new __getitem__ unit test is excellent. I've asked a few questions in-line (though feel free to answer those in person when we can next chat rather than writing up a response here) and there's one docstring that seems to need a minor tweak, as noted below, so ideally those can be addressed before merging.

Once this is merged I will make notes based on our previous LAMA -> Dask discussions, notably from the latest one:

  • the need to manage any methods appropriately where the hardness of mask may change; and
  • the problem and current best solution for a potentially changing .shape to prevent computing that unless unavoidable.

I will also update our master listing of daskified methods.

cf/data/QUESTIONS.rst Outdated Show resolved Hide resolved
cf/data/QUESTIONS.rst Outdated Show resolved Hide resolved
cf/test/test_Data.py Show resolved Hide resolved
cf/data/utils.py Show resolved Hide resolved
@@ -13325,3 +13579,54 @@ def _broadcast(a, shape):
tile = shape[0 : len(shape) - len(a_shape)] + tuple(tile[::-1])

return np.tile(a, tile)


"""Dask utilities to be called on chunks"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea adding such lines to help us organise all these (especially new) methods. 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think these could also be in their own file, instead of at the bottom of data.py? If you've got strong thoughts on that we'll go with those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think these could also be in their own file

Definitely, even better in fact, because there are currently ~13,500 lines in that module which is far too large in my opinion. I say we go with any reasonable means to lift out methods and move them into their own module(s), unless you disagree! So please do move them, in a new PR or otherwise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I'll do that in a new PR.

@@ -5914,9 +6179,13 @@ def size(self):

"""
dx = self._get_dask()
return dx.size
size = dx.size
if math.isnan(size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you have preferred math.isnan over numpy.isnan here (and in a few other places, I see) because it works in these non-array / pure number cases and takes less memory? Assuming so, I've noted down to make the same choice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know! dask itself uses both, in various places, and I'm not sure what the difference is, math.isnan is ~4 times faster when size is nan, but they're about the same speed when size is an integer . So I guess math.isnan is a good choice in general in such circumstances ...

Copy link
Member

@sadielbartholomew sadielbartholomew Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I guess the question arose because I would go for np.isnan by default.

So I guess math.isnan is a good choice in general in such circumstances ...

Shall we try to do this for now (math.isnan if it covers all bases for a given case, else use np) and I will make a note that towards the end of the daskification, as part of tidying work, we can quickly review which is used in any .isnan case we end up needing (as you point out it may not be significant but it would also take ~5 mins to change, so why not try to choose an overall slightly faster one)?


.. warning:: Never change the `_cyclic` attribute in-place.
.. warning:: Never change the value of the `_cyclic` attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we forbid _cyclic from being changed instead of warning strongly against it? Or is it perhaps not possible to do so due to some facet of laziness or similar? Or just not a good idea?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Good idea. we could get rid of @setter._cyclic, and instead of exposing some nasty internals (i.e. the custom dictionary) we could re-bury them in a setter method: def _set_cyclic(self, value). Let's open another PR for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good plan to me! Do you want to do that PR, or shall I (I am happy to, but you may already have started and/or want to implement it yourself)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All yours, thanks.

cf/data/data.py Outdated Show resolved Hide resolved
davidhassell and others added 3 commits October 4, 2021 18:00
Co-authored-by: Sadie L. Bartholomew <sadie.bartholomew@ncas.ac.uk>
Co-authored-by: Sadie L. Bartholomew <sadie.bartholomew@ncas.ac.uk>
Co-authored-by: Sadie L. Bartholomew <sadie.bartholomew@ncas.ac.uk>
@davidhassell
Copy link
Collaborator Author

davidhassell commented Oct 4, 2021

I think the outstanding items have all been shunted to new PRs. If you agree, please merge ... Thanks, Sadie, for the careful review, as ever.

Copy link
Member

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the outstanding items have all been shunted to new PRs

Indeed, this is now the case. Good to merge! Thanks David.

@sadielbartholomew sadielbartholomew merged commit 73704a5 into NCAS-CMS:lama-to-dask Oct 4, 2021
@davidhassell davidhassell deleted the dask-getitem branch November 15, 2022 09:11
@davidhassell davidhassell added this to the 3.14.0 milestone Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Relating to the use of Dask
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants