LAMA to Dask: fixes to `Query` & `Data.digitize` #427

sadielbartholomew · 2022-08-02T02:11:38Z

Reinstate a commented-out test case, now ready since pre-requisites have been satisfied, that calls where and tests digitize and in doing so reveals a bug in the migrated state of both (:flushed:), which this PR also fixes. See below for the breakdown.

(The fixes applied may not be the most appropriate course of action, or need tweaking, so I've applied them in self-contained commits which can be reverted if necessary. The most important thing here my outline of the issues arising below, which should be addressed by some means.)

Details

Overall this PR:

Reinstates the test case in question, as-was, in f86923f.

Fixes the first issue encountered when reinstating, namely that NoneType errors were being hit:

======================================================================
ERROR: test_Data_digitize (__main__.DataTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Data.py", line 829, in test_Data_digitize
    e.where(
  File "/home/sadie/cf-python/cf/decorators.py", line 62, in precede_with_kwarg_deprecation_check
    operation_method_result = operation_method(self, *args, **kwargs)
  File "/home/sadie/cfdm/cfdm/decorators.py", line 44, in inplace_wrapper
    processed_copy = operation_method(self, *args, **kwargs)
  File "/home/sadie/cfdm/cfdm/decorators.py", line 171, in verbose_override_wrapper
    return method_with_verbose_kwarg(*args, **kwargs)
  File "/home/sadie/cf-python/cf/data/data.py", line 93, in wrapper
    return method(*args, **kwargs)
  File "/home/sadie/cf-python/cf/data/data.py", line 9941, in where
    condition = condition.evaluate(d)
AttributeError: 'NoneType' object has no attribute 'evaluate'

----------------------------------------------------------------------

ultimately due to rogue assignment to set_condition_units which returns nothing, such that it seems highly likely that calling without assigning was the intended action, and a fix has been applied as such in e450a40.

Fixes the next issue revealed in the stack trace, this one occurring because Query.set_condition_units was introducing an extra dimension for unitless values which was manifesting as:

======================================================================
ERROR: test_Data_digitize (__main__.DataTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Data.py", line 829, in test_Data_digitize
    e.where(
  File "/home/sadie/cf-python/cf/decorators.py", line 62, in precede_with_kwarg_deprecation_check
    operation_method_result = operation_method(self, *args, **kwargs)
  File "/home/sadie/cfdm/cfdm/decorators.py", line 44, in inplace_wrapper
    processed_copy = operation_method(self, *args, **kwargs)
  File "/home/sadie/cfdm/cfdm/decorators.py", line 171, in verbose_override_wrapper
    return method_with_verbose_kwarg(*args, **kwargs)
  File "/home/sadie/cf-python/cf/data/data.py", line 93, in wrapper
    return method(*args, **kwargs)
  File "/home/sadie/cf-python/cf/data/data.py", line 9944, in where
    _where_broadcastable(d, condition, "condition")
  File "/home/sadie/cf-python/cf/data/data.py", line 12033, in _where_broadcastable
    raise ValueError(
ValueError: where: Broadcasting the 'condition' parameter with 4 dimensions would change the shape of the data with 3 dimensions

----------------------------------------------------------------------

with a sensible fix applied in 1aec661 (though in this case something else may be required in the unitless case, so not sure if straight-up removing the line is suitable as the general fix, this is to be discussed...).

Fixes the final issue revealed by the reinstated test, in 5bb5766, specifically that the core assertion checking equality between the intended and actual results was failing:

======================================================================
FAIL: test_Data_digitize (__main__.DataTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Data.py", line 836, in test_Data_digitize
    self.assertTrue(e.equals(f, verbose=2))
AssertionError: False is not true

----------------------------------------------------------------------

because (after much investigation into what was actually going wrong once that key assertion was finally reached!) the Data.digitize method itself created a list called delete_bins that was not used later in the code to do anything that had an effect on the outcome, when clearly it was designed to do something...

I cross-referenced the migration PR (dask: Dask.digitize #312) and old master code after which it seemed that the following block had not been incorporated:

cf-python/cf/data/data.py

Lines 2647 to 2652 in 07cb04f

    
           if delete_bins: 
        
               for n, d in enumerate(delete_bins): 
        
                   d -= n 
        
                   array = numpy_ma_where(array == d, numpy_ma_masked, array) 
        
                   array = numpy_ma_where(array > d, array - 1, array) 
        
           # --- End: if

so I introduced it and adapted it to the new way, and this change did indeed (finally!) get the test case to pass, without causing side effects elsewhere.

So I think (4) concludes the multi-fix, but let me know what you think, @davidhassell! Thanks.

sadielbartholomew · 2022-08-10T08:36:31Z

@davidhassell I have made updates in line with our discussion from last week, notably your suggestion regarding part (4) (see my opening comment) here, in the two latest commits. As far as I am concerned this is ready to go, so please review when you are back from leave and have time. Thanks.

davidhassell

Hi Sadie - the digitize and where changes look good, but I have a question on the Query changes ...

davidhassell · 2022-08-16T11:02:15Z

cf/query.py

-            # Value has no units
-            value = Data(value, units=units)
-        else:
+        if value_units is not None:


Not sure about this - it doesn't work if value does not have a Units attribute:

>>> q = cf.lt(6) >>> q._value 6 >>> q.set_condition_units('m') >>> q._value 6 # No units here!

The orginal code does:

>>> q._value 6 >>> q.set_condition_units('m') >>> q._value <CF Data(): 6 m> # Seems right to me ...

How was the original code causing problems?

Thanks David, I'm updating this in line with our discussion in the video call today which supersedes your comment here but obviously concerns the same line change and underlying issue.

This reverts commit 1aec661.

sadielbartholomew · 2022-11-18T19:09:02Z

Back onto this, with apologies for letting it fall off my radar! @davidhassell, to continue the conversation where we left off after the first round of feedback, namely regarding the outstanding piece of feedback #427 (comment)...

The last progress made towards this PR was back in August after your initial review where we both had a brief external pair-programming type session to investigate the query aspect of this PR together. Since I made written notes rather than electronic ones I've had to dig those out, and now I've found them, let me summarise what we found (hopefully it maps roughly onto any memory you might have of this conversation!).

Notes from discussion

Overall, there is an underlying issue with the query module, ultimately manifesting in the error I stated in my opening comment (case 3), but I hadn't fixed it with 1aec661 in a correct way that accounted for all possibilities sensibly, as you pointed out for a different case.
With some debugging we realised that the issue was due fundamentally to sequences of values being put into a container array, introducing an extra dimension, which caused the test failure, when those should have been handled properly. To demonstrate this via the failing test, as shown here on the original code i.e. before any suggestions by this PR, if I add in some print statements, as follows:

diff --git a/cf/query.py b/cf/query.py
index 67932f9f8..9a38cbffe 100644
--- a/cf/query.py
+++ b/cf/query.py
@@ -788,14 +788,19 @@ class Query:
             return
 
         value_units = getattr(value, "Units", None)
+        print("\n%%%%% 0 VALUE UNITS IS", value, value_units)
         if value_units is None:
             # Value has no units
+            print("\n%%%%% 1.0 IS\n", value, value_units)
             value = Data(value, units=units)
+            print("\n%%%%% 1.1 IS\n", value, value_units, value.shape)
         else:
             # Value already has units
             try:
+                print("\n%%%%% 2 IS", value, value_units, value.shape)
                 value.Units = units
             except ValueError:
+                print("\n%%%%% 3 IS", value, value_units, value.shape)
                 raise ValueError(
                     f"Units {units!r} are not equivalent to "
                     f"query condition units {value_units!r}"

then we see by running test_Data:

test_Data_digitize (__main__.DataTest) ... 
%%%%% 0 VALUE UNITS IS [<CF Data(1, 1, 1): [[[0]]]>, <CF Data(1, 1, 1): [[[5]]]>] None

%%%%% 1.0 IS
 [<CF Data(1, 1, 1): [[[0]]]>, <CF Data(1, 1, 1): [[[5]]]>] None

%%%%% 1.1 IS
 [[[[0, 5]]]] None (2, 1, 1, 1)
ERROR

which shows that the Data call introduced an extra dimension when the value input was a list of Data elements rather than a singular Data, hence the ultimate error thrown.

In terms of how to handle the sequences of Data case causing issue as above, I didn't write down the approach we tentatively agreed to (my bad for not finalising this PR shortly after we discussed it!) but the natural approach and the one that seems to match my vague memory of the conversation for what we agreed would be sensible as a solution is to check if the value is a sequence of Data and if so concatenate the separate Data together and then apply the specified units. This has been implemented in my new commit 793e7ce which I applied after reverting the commit making the initial changes to the query module.

Note it passes, but only if the current concatenate method code is copied from the lama-to-dask branch since when I started this PR that method had not yet been migrated to Dask, so otherwise the test fails on trivial migration-related issues with the call to concatenate.

Going forward

Let me know if you agree with the above summary of the underlying issue and the suggested solution, notably see the two new commits (one just to revert the previous changes touching the query code to be clean), and either way how we can progress. Thanks.

davidhassell · 2022-11-20T16:04:55Z

cf/query.py

@@ -790,7 +791,11 @@ def set_condition_units(self, units):
        value_units = getattr(value, "Units", None)
        if value_units is None:
            # Value has no units
-            value = Data(value, units=units)
+            if isinstance(value, Iterable):  # may be a sequence of Data


Edit Hi Sadie, I think what I wrote here may well be nonsense! Please ignore, and the code snippet and I'll have another think ...

~~Hi Sadie, not sure this will work in all cases:~~

~~Data objects are Iterable~~

~~When value is a sequence, we want it to remain a sequence, not convert it into a Data object, and Data.concatenate doesn't work for non-Data objects~~

~~This code might make sense, but it's not very pretty!~~

IGNORE value_units = getattr(value, "Units", None) if value_units is None: # Value has no units if isinstance(value, Iterable) and not isinstance(value, str): new = [] for v in value: value_units = getattr(v, "Units", None) if value_units is None: # Value has no units v = Data(v, units=units) else: # Value already has units try: v.Units = units except ValueError: raise ValueError( f"Units {units!r} are not equivalent to " f"query condition units {value_units!r}" ) new.append(v) value = new else: value = Data(value, units=units) else: # Value already has units try: value.Units = units except ValueError: raise ValueError( f"Units {units!r} are not equivalent to " f"query condition units {value_units!r}" )

OK - I think I worked out what's going on. The key is that the contents of a "set", "wi" or "wo" iterable are not necessarily Data objects (or rather objects with Units). How about this? There's scope for a bit of code re-use here, but I can't quite see how, yet.

value_units = getattr(value, "Units", None) if value_units is None: # Value has no units if self.operator in ("wi", "wo", "set"): # value is a sequence of things that may or may not # already have units new = [] for v in value: v_units = getattr(v, "Units", None) if v_units is None: v = Data(v, units=units) else: try: v = v.copy() v.Units = units except ValueError: raise ValueError( f"Units {units!r} are not equivalent to " f"query condition units {v_units!r}" ) new.append(v) value = new else: value = Data(value, units=units) else: # Value already has units try: value = value.copy() value.Units = units except ValueError: raise ValueError( f"Units {units!r} are not equivalent to " f"query condition units {value_units!r}" ) self._value = value

We could do with some more units test for this, too ...

I've just added your code (putting you as a formal co-author since you wrote it, though I don't think I can specify you as 'full' author since I'm committing it!) and then consolidated it with an inner helper function. I'll add some new testing first thing tomorrow.

davidhassell · 2022-11-21T08:39:47Z

Hi Sadie - I think we have a bit of an issue here, as some of this code has already been modified in #464 - could you merge lama-to-dask into your branch and have another look :) Thanks

sadielbartholomew · 2022-11-21T14:15:38Z

could you merge lama-to-dask into your branch and have another look

Hi David, no worries. I've done as you ask but it seems to have merged cleanly, and upon investigation it looks like the one commit where I touched where code, namely e450a40, to fix one issue, had since I opened this PR been merged in to lama-to-dask elsewhere, and in fact I see you did make the exact same change in #464. So we're all good with respect to this - there's no issue!

Overall, with my new merge commit, the test_Data tests all still pass (apart from the one that is still skipped), so everything is good to go still as far as I am concerned, but I await your thoughts regarding:

I think what I wrote here may well be nonsense! Please ignore, and the code snippet and I'll have another think ...

in case changes are necessary with regards to that. Thanks!

davidhassell · 2022-11-21T14:18:48Z

Thanks, Sadie. I'm now going to stop dipping in to this piecemeal and do a proper review later today/tomorrow, Sorry for any extra confusion I have wrought!

davidhassell

Hi Sadie, I've made a suggestion for the changes to Query (and a request for some more tests), otherwise all good. A small amount of code caused a lot of thought, here!

cf/data/data.py

davidhassell · 2022-11-22T11:40:46Z

cf/query.py

@@ -790,7 +791,11 @@ def set_condition_units(self, units):
        value_units = getattr(value, "Units", None)
        if value_units is None:
            # Value has no units
-            value = Data(value, units=units)
+            if isinstance(value, Iterable):  # may be a sequence of Data


OK - I think I worked out what's going on. The key is that the contents of a "set", "wi" or "wo" iterable are not necessarily Data objects (or rather objects with Units). How about this? There's scope for a bit of code re-use here, but I can't quite see how, yet.

value_units = getattr(value, "Units", None) if value_units is None: # Value has no units if self.operator in ("wi", "wo", "set"): # value is a sequence of things that may or may not # already have units new = [] for v in value: v_units = getattr(v, "Units", None) if v_units is None: v = Data(v, units=units) else: try: v = v.copy() v.Units = units except ValueError: raise ValueError( f"Units {units!r} are not equivalent to " f"query condition units {v_units!r}" ) new.append(v) value = new else: value = Data(value, units=units) else: # Value already has units try: value = value.copy() value.Units = units except ValueError: raise ValueError( f"Units {units!r} are not equivalent to " f"query condition units {value_units!r}" ) self._value = value

We could do with some more units test for this, too ...

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

sadielbartholomew · 2022-11-22T12:21:27Z

The key is that the contents of a "set", "wi" or "wo" iterable are not necessarily Data objects (or rather objects with Units.

Aha, that makes so much sense now you have worked it out and summarised it! And the change block you suggest follows on sensibly from that. I'll get the equivalent code updated in line with your suggestion. Thanks!

We could do with some more units test for this, too ...

Agreed. Right-o, I'll increase the coverage in this respect as part of this PR. I'll update it shortly and tag you to let you know we're ready for a re-review.

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

sadielbartholomew · 2022-12-05T17:27:56Z

Sorry @davidhassell before I went on leave I think I forgot to confirm that this is ready for re-review now after I updated this in line with your suggestion and made some consolidations too. Please let me know what you think.

davidhassell

Hi Sadie - A nice solution, thanks. I've made a couple of comments, but go ahead and merge when you're ready.

cf/query.py

Out-of-date now, later comments made and raise an approval.

sadielbartholomew · 2022-12-06T17:16:47Z

Thanks for all of your very useful feedback on this @davidhassell. Final round of feedback addressed in my latest commit, so I'll merge as advised and open up one Issue as discussed.

Data.where: remove rogue assigment to fix NoneType errors

e450a40

sadielbartholomew added the dask Relating to the use of Dask label Aug 2, 2022

sadielbartholomew self-assigned this Aug 2, 2022

sadielbartholomew added 2 commits August 2, 2022 16:11

Reinstate test_Data_digitize test case now pre-reqs are satisfied

f86923f

Query.set_condition_units: stop extra dim. being added if unitless

1aec661

sadielbartholomew changed the title ~~LAMA to Dask: fixes relating to Data.where~~ LAMA to Dask: fixes to Data.where, .digitize Aug 2, 2022

sadielbartholomew changed the title ~~LAMA to Dask: fixes to Data.where, .digitize~~ LAMA to Dask: fixes to Data.where Aug 2, 2022

sadielbartholomew mentioned this pull request Aug 3, 2022

Query module: reduce equivalent encapsulated conditions #430

Open

Data.digitize: add missing logic to process delete_bins list

5bb5766

sadielbartholomew changed the title ~~LAMA to Dask: fixes to Data.where~~ LAMA to Dask: fixes to Data.where, .digitize Aug 3, 2022

sadielbartholomew requested a review from davidhassell August 3, 2022 19:28

sadielbartholomew marked this pull request as ready for review August 3, 2022 19:36

sadielbartholomew added 2 commits August 10, 2022 09:16

Data.digitize: move delete_bins processing from Dask- to cf- space

3d43251

Data.digitize: clarify nature of (pre-existing) delete_bins logic

d18a69a

davidhassell reviewed Aug 16, 2022

View reviewed changes

davidhassell added this to the 3.14.0 milestone Nov 14, 2022

sadielbartholomew added 2 commits November 18, 2022 16:15

Revert "Query.set_condition_units: stop ... if unitless"

39faa6c

This reverts commit 1aec661.

Query.set_condition_units: handle sequences of Data to fix bug

793e7ce

davidhassell reviewed Nov 20, 2022

View reviewed changes

Merge branch 'upstream/lama-to-dask' into lama-to-dask-where-fixes

de90df9

davidhassell previously requested changes Nov 22, 2022

View reviewed changes

Update cf/data/data.py by fixing comment grammar

0bf9b01

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

sadielbartholomew changed the title ~~LAMA to Dask: fixes to Data.where, .digitize~~ LAMA to Dask: fixes to Query & Data.digitize Nov 22, 2022

Query.set_condition_units: apply fix to handle set & range inputs

999f827

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

Query.set_condition_units: consolidate w/ inner helper function

33c9c21

sadielbartholomew force-pushed the lama-to-dask-where-fixes branch from 25a2184 to 33c9c21 Compare November 22, 2022 19:48

Query.set_condition_units: remove now-superfluous import

4ef9c09

davidhassell self-requested a review November 28, 2022 14:10

davidhassell reviewed Dec 6, 2022

View reviewed changes

cf/query.py Show resolved Hide resolved

davidhassell reviewed Dec 6, 2022

View reviewed changes

cf/query.py Outdated Show resolved Hide resolved

Address feedback by applying linting fixes for linter checks pass

cb7d1e8

sadielbartholomew merged commit f17e08e into NCAS-CMS:lama-to-dask Dec 6, 2022

sadielbartholomew deleted the lama-to-dask-where-fixes branch December 6, 2022 17:17

sadielbartholomew mentioned this pull request Dec 6, 2022

Potentially slow iteration for setting condition value units #511

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LAMA to Dask: fixes to `Query` & `Data.digitize` #427

LAMA to Dask: fixes to `Query` & `Data.digitize` #427

sadielbartholomew commented Aug 2, 2022 •

edited

Loading

sadielbartholomew commented Aug 10, 2022 •

edited

Loading

davidhassell left a comment

davidhassell Aug 16, 2022

sadielbartholomew Aug 16, 2022

sadielbartholomew commented Nov 18, 2022 •

edited

Loading

davidhassell Nov 20, 2022 •

edited

Loading

davidhassell Nov 22, 2022 •

edited

Loading

sadielbartholomew Nov 22, 2022 •

edited

Loading

davidhassell commented Nov 21, 2022

sadielbartholomew commented Nov 21, 2022 •

edited

Loading

davidhassell commented Nov 21, 2022

davidhassell left a comment

davidhassell Nov 22, 2022 •

edited

Loading

sadielbartholomew commented Nov 22, 2022

sadielbartholomew commented Dec 5, 2022 •

edited

Loading

davidhassell left a comment

sadielbartholomew commented Dec 6, 2022 •

edited

Loading

	if delete_bins:
	for n, d in enumerate(delete_bins):
	d -= n
	array = numpy_ma_where(array == d, numpy_ma_masked, array)
	array = numpy_ma_where(array > d, array - 1, array)
	# --- End: if

LAMA to Dask: fixes to Query & Data.digitize #427

LAMA to Dask: fixes to Query & Data.digitize #427

Conversation

sadielbartholomew commented Aug 2, 2022 • edited Loading

Details

sadielbartholomew commented Aug 10, 2022 • edited Loading

davidhassell left a comment

Choose a reason for hiding this comment

davidhassell Aug 16, 2022

Choose a reason for hiding this comment

sadielbartholomew Aug 16, 2022

Choose a reason for hiding this comment

sadielbartholomew commented Nov 18, 2022 • edited Loading

Notes from discussion

Going forward

davidhassell Nov 20, 2022 • edited Loading

Choose a reason for hiding this comment

davidhassell Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

sadielbartholomew Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

davidhassell commented Nov 21, 2022

sadielbartholomew commented Nov 21, 2022 • edited Loading

davidhassell commented Nov 21, 2022

davidhassell left a comment

Choose a reason for hiding this comment

davidhassell Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

sadielbartholomew commented Nov 22, 2022

sadielbartholomew commented Dec 5, 2022 • edited Loading

davidhassell left a comment

Choose a reason for hiding this comment

sadielbartholomew commented Dec 6, 2022 • edited Loading

LAMA to Dask: fixes to `Query` & `Data.digitize` #427

LAMA to Dask: fixes to `Query` & `Data.digitize` #427

sadielbartholomew commented Aug 2, 2022 •

edited

Loading

sadielbartholomew commented Aug 10, 2022 •

edited

Loading

sadielbartholomew commented Nov 18, 2022 •

edited

Loading

davidhassell Nov 20, 2022 •

edited

Loading

davidhassell Nov 22, 2022 •

edited

Loading

sadielbartholomew Nov 22, 2022 •

edited

Loading

sadielbartholomew commented Nov 21, 2022 •

edited

Loading

davidhassell Nov 22, 2022 •

edited

Loading

sadielbartholomew commented Dec 5, 2022 •

edited

Loading

sadielbartholomew commented Dec 6, 2022 •

edited

Loading