dask: `Data.getitem`, `Data.setitem` #257

davidhassell · 2021-09-06T11:30:01Z

No description provided.

sadielbartholomew

Generally this looks good, works well, and the new __getitem__ unit test is excellent. I've asked a few questions in-line (though feel free to answer those in person when we can next chat rather than writing up a response here) and there's one docstring that seems to need a minor tweak, as noted below, so ideally those can be addressed before merging.

Once this is merged I will make notes based on our previous LAMA -> Dask discussions, notably from the latest one:

the need to manage any methods appropriately where the hardness of mask may change; and
the problem and current best solution for a potentially changing .shape to prevent computing that unless unavoidable.

I will also update our master listing of daskified methods.

cf/data/QUESTIONS.rst

cf/test/test_Data.py

cf/data/utils.py

sadielbartholomew · 2021-10-04T15:50:34Z

cf/data/data.py

@@ -13325,3 +13579,54 @@ def _broadcast(a, shape):
    tile = shape[0 : len(shape) - len(a_shape)] + tuple(tile[::-1])

    return np.tile(a, tile)
+
+
+"""Dask utilities to be called on chunks"""


Good idea adding such lines to help us organise all these (especially new) methods. 👍

Do you think these could also be in their own file, instead of at the bottom of data.py? If you've got strong thoughts on that we'll go with those.

Do you think these could also be in their own file

Definitely, even better in fact, because there are currently ~13,500 lines in that module which is far too large in my opinion. I say we go with any reasonable means to lift out methods and move them into their own module(s), unless you disagree! So please do move them, in a new PR or otherwise.

OK - I'll do that in a new PR.

sadielbartholomew · 2021-10-04T15:59:09Z

cf/data/data.py

@@ -5914,9 +6179,13 @@ def size(self):

        """
        dx = self._get_dask()
-        return dx.size
+        size = dx.size
+        if math.isnan(size):


I guess you have preferred math.isnan over numpy.isnan here (and in a few other places, I see) because it works in these non-array / pure number cases and takes less memory? Assuming so, I've noted down to make the same choice.

I don't know! dask itself uses both, in various places, and I'm not sure what the difference is, math.isnan is ~4 times faster when size is nan, but they're about the same speed when size is an integer . So I guess math.isnan is a good choice in general in such circumstances ...

Fair enough, I guess the question arose because I would go for np.isnan by default.

So I guess math.isnan is a good choice in general in such circumstances ...

Shall we try to do this for now (math.isnan if it covers all bases for a given case, else use np) and I will make a note that towards the end of the daskification, as part of tidying work, we can quickly review which is used in any .isnan case we end up needing (as you point out it may not be significant but it would also take ~5 mins to change, so why not try to choose an overall slightly faster one)?

sadielbartholomew · 2021-10-04T16:14:29Z

cf/data/data.py


-        .. warning:: Never change the `_cyclic` attribute in-place.
+        .. warning:: Never change the value of the `_cyclic` attribute


Can we forbid _cyclic from being changed instead of warning strongly against it? Or is it perhaps not possible to do so due to some facet of laziness or similar? Or just not a good idea?

Hmm. Good idea. we could get rid of @setter._cyclic, and instead of exposing some nasty internals (i.e. the custom dictionary) we could re-bury them in a setter method: def _set_cyclic(self, value). Let's open another PR for that.

Sounds like a good plan to me! Do you want to do that PR, or shall I (I am happy to, but you may already have started and/or want to implement it yourself)?

All yours, thanks.

cf/data/data.py

Co-authored-by: Sadie L. Bartholomew <sadie.bartholomew@ncas.ac.uk>

davidhassell · 2021-10-04T19:11:41Z

I think the outstanding items have all been shunted to new PRs. If you agree, please merge ... Thanks, Sadie, for the careful review, as ever.

sadielbartholomew

I think the outstanding items have all been shunted to new PRs

Indeed, this is now the case. Good to merge! Thanks David.

davidhassell added 3 commits September 5, 2021 12:44

cyclic axes

3acffce

getitem: cyclic axes, unit tests

900832e

remove debugging print statements

336bd68

davidhassell changed the title ~~Daskify __getitem__~~ Data.__getitem__ Sep 6, 2021

davidhassell changed the title ~~Data.__getitem__~~ dask: Data.__getitem__, Data.__setitem__ Sep 6, 2021

davidhassell added 6 commits September 6, 2021 17:26

setitem

1bfe9fe

hardmask, unknown shape

8f07e1f

information

021e068

tidy

9c1a4c4

hardmask

9d3f5a7

dask reset_mask_hardness, docs

742c3e2

davidhassell requested a review from sadielbartholomew September 13, 2021 15:26

davidhassell added the dask Relating to the use of Dask label Sep 30, 2021

sadielbartholomew approved these changes Oct 4, 2021

View reviewed changes

davidhassell and others added 3 commits October 4, 2021 18:00

Typos

59bc506

Co-authored-by: Sadie L. Bartholomew <sadie.bartholomew@ncas.ac.uk>

Typos

142556d

Co-authored-by: Sadie L. Bartholomew <sadie.bartholomew@ncas.ac.uk>

Correct docstring

aecd89b

Co-authored-by: Sadie L. Bartholomew <sadie.bartholomew@ncas.ac.uk>

sadielbartholomew approved these changes Oct 4, 2021

View reviewed changes

sadielbartholomew merged commit 73704a5 into NCAS-CMS:lama-to-dask Oct 4, 2021

sadielbartholomew mentioned this pull request Oct 6, 2021

dask: Data.__init__ #262

Merged

davidhassell mentioned this pull request Oct 7, 2021

Migrate Data.transpose from LAMA to Dask #247

Merged

davidhassell mentioned this pull request Jan 27, 2022

Replace LAMA with Dask: grouping methods to migrate #295

Closed

davidhassell deleted the dask-getitem branch November 15, 2022 09:11

davidhassell added this to the 3.14.0 milestone Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dask: `Data.getitem`, `Data.setitem` #257

dask: `Data.getitem`, `Data.setitem` #257

davidhassell commented Sep 6, 2021

sadielbartholomew left a comment •

edited

Loading

sadielbartholomew Oct 4, 2021

davidhassell Oct 4, 2021

sadielbartholomew Oct 4, 2021

davidhassell Oct 4, 2021

sadielbartholomew Oct 4, 2021

davidhassell Oct 4, 2021

sadielbartholomew Oct 4, 2021 •

edited

Loading

sadielbartholomew Oct 4, 2021

davidhassell Oct 4, 2021

sadielbartholomew Oct 4, 2021

davidhassell Oct 4, 2021

davidhassell commented Oct 4, 2021 •

edited

Loading

sadielbartholomew left a comment


		.. warning:: Never change the `_cyclic` attribute in-place.
		.. warning:: Never change the value of the `_cyclic` attribute

dask: Data.__getitem__, Data.__setitem__ #257

dask: Data.__getitem__, Data.__setitem__ #257

Conversation

davidhassell commented Sep 6, 2021

sadielbartholomew left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadielbartholomew Oct 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidhassell commented Oct 4, 2021 • edited Loading

sadielbartholomew left a comment

Choose a reason for hiding this comment

dask: `Data.getitem`, `Data.setitem` #257

dask: `Data.getitem`, `Data.setitem` #257

sadielbartholomew left a comment •

edited

Loading

sadielbartholomew Oct 4, 2021 •

edited

Loading

davidhassell commented Oct 4, 2021 •

edited

Loading