Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAMA to Dask: fixes to Query & Data.digitize #427

Conversation

sadielbartholomew
Copy link
Member

@sadielbartholomew sadielbartholomew commented Aug 2, 2022

Reinstate a commented-out test case, now ready since pre-requisites have been satisfied, that calls where and tests digitize and in doing so reveals a bug in the migrated state of both (:flushed:), which this PR also fixes. See below for the breakdown.

(The fixes applied may not be the most appropriate course of action, or need tweaking, so I've applied them in self-contained commits which can be reverted if necessary. The most important thing here my outline of the issues arising below, which should be addressed by some means.)

Details

Overall this PR:

  1. Reinstates the test case in question, as-was, in f86923f.

  2. Fixes the first issue encountered when reinstating, namely that NoneType errors were being hit:

    ======================================================================
    ERROR: test_Data_digitize (__main__.DataTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "test_Data.py", line 829, in test_Data_digitize
        e.where(
      File "/home/sadie/cf-python/cf/decorators.py", line 62, in precede_with_kwarg_deprecation_check
        operation_method_result = operation_method(self, *args, **kwargs)
      File "/home/sadie/cfdm/cfdm/decorators.py", line 44, in inplace_wrapper
        processed_copy = operation_method(self, *args, **kwargs)
      File "/home/sadie/cfdm/cfdm/decorators.py", line 171, in verbose_override_wrapper
        return method_with_verbose_kwarg(*args, **kwargs)
      File "/home/sadie/cf-python/cf/data/data.py", line 93, in wrapper
        return method(*args, **kwargs)
      File "/home/sadie/cf-python/cf/data/data.py", line 9941, in where
        condition = condition.evaluate(d)
    AttributeError: 'NoneType' object has no attribute 'evaluate'
    
    ----------------------------------------------------------------------

    ultimately due to rogue assignment to set_condition_units which returns nothing, such that it seems highly likely that calling without assigning was the intended action, and a fix has been applied as such in e450a40.

  3. Fixes the next issue revealed in the stack trace, this one occurring because Query.set_condition_units was introducing an extra dimension for unitless values which was manifesting as:

    ======================================================================
    ERROR: test_Data_digitize (__main__.DataTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "test_Data.py", line 829, in test_Data_digitize
        e.where(
      File "/home/sadie/cf-python/cf/decorators.py", line 62, in precede_with_kwarg_deprecation_check
        operation_method_result = operation_method(self, *args, **kwargs)
      File "/home/sadie/cfdm/cfdm/decorators.py", line 44, in inplace_wrapper
        processed_copy = operation_method(self, *args, **kwargs)
      File "/home/sadie/cfdm/cfdm/decorators.py", line 171, in verbose_override_wrapper
        return method_with_verbose_kwarg(*args, **kwargs)
      File "/home/sadie/cf-python/cf/data/data.py", line 93, in wrapper
        return method(*args, **kwargs)
      File "/home/sadie/cf-python/cf/data/data.py", line 9944, in where
        _where_broadcastable(d, condition, "condition")
      File "/home/sadie/cf-python/cf/data/data.py", line 12033, in _where_broadcastable
        raise ValueError(
    ValueError: where: Broadcasting the 'condition' parameter with 4 dimensions would change the shape of the data with 3 dimensions
    
    ----------------------------------------------------------------------

    with a sensible fix applied in 1aec661 (though in this case something else may be required in the unitless case, so not sure if straight-up removing the line is suitable as the general fix, this is to be discussed...).

  4. Fixes the final issue revealed by the reinstated test, in 5bb5766, specifically that the core assertion checking equality between the intended and actual results was failing:

    ======================================================================
    FAIL: test_Data_digitize (__main__.DataTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "test_Data.py", line 836, in test_Data_digitize
        self.assertTrue(e.equals(f, verbose=2))
    AssertionError: False is not true
    
    ----------------------------------------------------------------------

    because (after much investigation into what was actually going wrong once that key assertion was finally reached!) the Data.digitize method itself created a list called delete_bins that was not used later in the code to do anything that had an effect on the outcome, when clearly it was designed to do something...

    I cross-referenced the migration PR (dask: Dask.digitize #312) and old master code after which it seemed that the following block had not been incorporated:

    cf-python/cf/data/data.py

    Lines 2647 to 2652 in 07cb04f

    if delete_bins:
    for n, d in enumerate(delete_bins):
    d -= n
    array = numpy_ma_where(array == d, numpy_ma_masked, array)
    array = numpy_ma_where(array > d, array - 1, array)
    # --- End: if

    so I introduced it and adapted it to the new way, and this change did indeed (finally!) get the test case to pass, without causing side effects elsewhere.

So I think (4) concludes the multi-fix, but let me know what you think, @davidhassell! Thanks.

@sadielbartholomew sadielbartholomew added the dask Relating to the use of Dask label Aug 2, 2022
@sadielbartholomew sadielbartholomew self-assigned this Aug 2, 2022
@sadielbartholomew sadielbartholomew changed the title LAMA to Dask: fixes relating to Data.where LAMA to Dask: fixes to Data.where, .digitize Aug 2, 2022
@sadielbartholomew sadielbartholomew changed the title LAMA to Dask: fixes to Data.where, .digitize LAMA to Dask: fixes to Data.where Aug 2, 2022
@sadielbartholomew sadielbartholomew changed the title LAMA to Dask: fixes to Data.where LAMA to Dask: fixes to Data.where, .digitize Aug 3, 2022
@sadielbartholomew sadielbartholomew marked this pull request as ready for review August 3, 2022 19:36
@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Aug 10, 2022

@davidhassell I have made updates in line with our discussion from last week, notably your suggestion regarding part (4) (see my opening comment) here, in the two latest commits. As far as I am concerned this is ready to go, so please review when you are back from leave and have time. Thanks.

Copy link
Collaborator

@davidhassell davidhassell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sadie - the digitize and where changes look good, but I have a question on the Query changes ...

cf/query.py Outdated
# Value has no units
value = Data(value, units=units)
else:
if value_units is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this - it doesn't work if value does not have a Units attribute:

>>> q = cf.lt(6)
>>> q._value
6
>>> q.set_condition_units('m')
>>> q._value
6  # No units here!

The orginal code does:

>>> q._value
6
>>> q.set_condition_units('m')
>>> q._value
<CF Data(): 6 m>  # Seems right to me ...

How was the original code causing problems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David, I'm updating this in line with our discussion in the video call today which supersedes your comment here but obviously concerns the same line change and underlying issue.

@davidhassell davidhassell added this to the 3.14.0 milestone Nov 14, 2022
@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Nov 18, 2022

Back onto this, with apologies for letting it fall off my radar! @davidhassell, to continue the conversation where we left off after the first round of feedback, namely regarding the outstanding piece of feedback #427 (comment)...

The last progress made towards this PR was back in August after your initial review where we both had a brief external pair-programming type session to investigate the query aspect of this PR together. Since I made written notes rather than electronic ones I've had to dig those out, and now I've found them, let me summarise what we found (hopefully it maps roughly onto any memory you might have of this conversation!).

Notes from discussion

  • Overall, there is an underlying issue with the query module, ultimately manifesting in the error I stated in my opening comment (case 3), but I hadn't fixed it with 1aec661 in a correct way that accounted for all possibilities sensibly, as you pointed out for a different case.
  • With some debugging we realised that the issue was due fundamentally to sequences of values being put into a container array, introducing an extra dimension, which caused the test failure, when those should have been handled properly. To demonstrate this via the failing test, as shown here on the original code i.e. before any suggestions by this PR, if I add in some print statements, as follows:
diff --git a/cf/query.py b/cf/query.py
index 67932f9f8..9a38cbffe 100644
--- a/cf/query.py
+++ b/cf/query.py
@@ -788,14 +788,19 @@ class Query:
             return
 
         value_units = getattr(value, "Units", None)
+        print("\n%%%%% 0 VALUE UNITS IS", value, value_units)
         if value_units is None:
             # Value has no units
+            print("\n%%%%% 1.0 IS\n", value, value_units)
             value = Data(value, units=units)
+            print("\n%%%%% 1.1 IS\n", value, value_units, value.shape)
         else:
             # Value already has units
             try:
+                print("\n%%%%% 2 IS", value, value_units, value.shape)
                 value.Units = units
             except ValueError:
+                print("\n%%%%% 3 IS", value, value_units, value.shape)
                 raise ValueError(
                     f"Units {units!r} are not equivalent to "
                     f"query condition units {value_units!r}"

then we see by running test_Data:

test_Data_digitize (__main__.DataTest) ... 
%%%%% 0 VALUE UNITS IS [<CF Data(1, 1, 1): [[[0]]]>, <CF Data(1, 1, 1): [[[5]]]>] None

%%%%% 1.0 IS
 [<CF Data(1, 1, 1): [[[0]]]>, <CF Data(1, 1, 1): [[[5]]]>] None

%%%%% 1.1 IS
 [[[[0, 5]]]] None (2, 1, 1, 1)
ERROR

which shows that the Data call introduced an extra dimension when the value input was a list of Data elements rather than a singular Data, hence the ultimate error thrown.

  • In terms of how to handle the sequences of Data case causing issue as above, I didn't write down the approach we tentatively agreed to (my bad for not finalising this PR shortly after we discussed it!) but the natural approach and the one that seems to match my vague memory of the conversation for what we agreed would be sensible as a solution is to check if the value is a sequence of Data and if so concatenate the separate Data together and then apply the specified units. This has been implemented in my new commit 793e7ce which I applied after reverting the commit making the initial changes to the query module.

Note it passes, but only if the current concatenate method code is copied from the lama-to-dask branch since when I started this PR that method had not yet been migrated to Dask, so otherwise the test fails on trivial migration-related issues with the call to concatenate.

Going forward

Let me know if you agree with the above summary of the underlying issue and the suggested solution, notably see the two new commits (one just to revert the previous changes touching the query code to be clean), and either way how we can progress. Thanks.

cf/query.py Outdated
@@ -790,7 +791,11 @@ def set_condition_units(self, units):
value_units = getattr(value, "Units", None)
if value_units is None:
# Value has no units
value = Data(value, units=units)
if isinstance(value, Iterable): # may be a sequence of Data
Copy link
Collaborator

@davidhassell davidhassell Nov 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit Hi Sadie, I think what I wrote here may well be nonsense! Please ignore, and the code snippet and I'll have another think ...

Hi Sadie, not sure this will work in all cases:

  • Data objects are Iterable
  • When value is a sequence, we want it to remain a sequence, not convert it into a Data object, and Data.concatenate doesn't work for non-Data objects

This code might make sense, but it's not very pretty!

IGNORE


        value_units = getattr(value, "Units", None)
        if value_units is None:
            # Value has no units
            if isinstance(value, Iterable) and not isinstance(value, str):
                new = []
                for v in value:
                    value_units = getattr(v, "Units", None)
                    if value_units is None:
                        # Value has no units
                        v = Data(v, units=units)
                    else:
                        # Value already has units
                        try:
                            v.Units = units
                        except ValueError:
                            raise ValueError(
                                f"Units {units!r} are not equivalent to "
                                f"query condition units {value_units!r}"
                            )
                        
                    new.append(v)

                value = new
            else:
                value = Data(value, units=units)        
        else:
            # Value already has units
            try:
                value.Units = units
            except ValueError:
                raise ValueError(
                    f"Units {units!r} are not equivalent to "
                    f"query condition units {value_units!r}"
                )

Copy link
Collaborator

@davidhassell davidhassell Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I think I worked out what's going on. The key is that the contents of a "set", "wi" or "wo" iterable are not necessarily Data objects (or rather objects with Units). How about this? There's scope for a bit of code re-use here, but I can't quite see how, yet.

        value_units = getattr(value, "Units", None)
        if value_units is None:
            # Value has no units
            if self.operator in ("wi", "wo", "set"):
                # value is a sequence of things that may or may not
                # already have units
                new = []
                for v in value:
                    v_units = getattr(v, "Units", None)
                    if v_units is None:
                        v = Data(v, units=units)
                    else:
                        try:
                            v = v.copy()
                            v.Units = units
                        except ValueError:
                            raise ValueError(
                                f"Units {units!r} are not equivalent to "
                                f"query condition units {v_units!r}"
                            )
                        
                    new.append(v)

                value = new
            else:
                value = Data(value, units=units)
        else:
            # Value already has units
            try:
                value = value.copy()
                value.Units = units
            except ValueError:
                raise ValueError(
                    f"Units {units!r} are not equivalent to "
                    f"query condition units {value_units!r}"
                )

        self._value = value

We could do with some more units test for this, too ...

Copy link
Member Author

@sadielbartholomew sadielbartholomew Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added your code (putting you as a formal co-author since you wrote it, though I don't think I can specify you as 'full' author since I'm committing it!) and then consolidated it with an inner helper function. I'll add some new testing first thing tomorrow.

@davidhassell
Copy link
Collaborator

Hi Sadie - I think we have a bit of an issue here, as some of this code has already been modified in #464 - could you merge lama-to-dask into your branch and have another look :) Thanks

@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Nov 21, 2022

could you merge lama-to-dask into your branch and have another look

Hi David, no worries. I've done as you ask but it seems to have merged cleanly, and upon investigation it looks like the one commit where I touched where code, namely e450a40, to fix one issue, had since I opened this PR been merged in to lama-to-dask elsewhere, and in fact I see you did make the exact same change in #464. So we're all good with respect to this - there's no issue!

Overall, with my new merge commit, the test_Data tests all still pass (apart from the one that is still skipped), so everything is good to go still as far as I am concerned, but I await your thoughts regarding:

I think what I wrote here may well be nonsense! Please ignore, and the code snippet and I'll have another think ...

in case changes are necessary with regards to that. Thanks!

@davidhassell
Copy link
Collaborator

Thanks, Sadie. I'm now going to stop dipping in to this piecemeal and do a proper review later today/tomorrow, Sorry for any extra confusion I have wrought!

Copy link
Collaborator

@davidhassell davidhassell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sadie, I've made a suggestion for the changes to Query (and a request for some more tests), otherwise all good. A small amount of code caused a lot of thought, here!

cf/data/data.py Outdated Show resolved Hide resolved
cf/query.py Outdated
@@ -790,7 +791,11 @@ def set_condition_units(self, units):
value_units = getattr(value, "Units", None)
if value_units is None:
# Value has no units
value = Data(value, units=units)
if isinstance(value, Iterable): # may be a sequence of Data
Copy link
Collaborator

@davidhassell davidhassell Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I think I worked out what's going on. The key is that the contents of a "set", "wi" or "wo" iterable are not necessarily Data objects (or rather objects with Units). How about this? There's scope for a bit of code re-use here, but I can't quite see how, yet.

        value_units = getattr(value, "Units", None)
        if value_units is None:
            # Value has no units
            if self.operator in ("wi", "wo", "set"):
                # value is a sequence of things that may or may not
                # already have units
                new = []
                for v in value:
                    v_units = getattr(v, "Units", None)
                    if v_units is None:
                        v = Data(v, units=units)
                    else:
                        try:
                            v = v.copy()
                            v.Units = units
                        except ValueError:
                            raise ValueError(
                                f"Units {units!r} are not equivalent to "
                                f"query condition units {v_units!r}"
                            )
                        
                    new.append(v)

                value = new
            else:
                value = Data(value, units=units)
        else:
            # Value already has units
            try:
                value = value.copy()
                value.Units = units
            except ValueError:
                raise ValueError(
                    f"Units {units!r} are not equivalent to "
                    f"query condition units {value_units!r}"
                )

        self._value = value

We could do with some more units test for this, too ...

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
@sadielbartholomew
Copy link
Member Author

The key is that the contents of a "set", "wi" or "wo" iterable are not necessarily Data objects (or rather objects with Units.

Aha, that makes so much sense now you have worked it out and summarised it! And the change block you suggest follows on sensibly from that. I'll get the equivalent code updated in line with your suggestion. Thanks!

We could do with some more units test for this, too ...

Agreed. Right-o, I'll increase the coverage in this respect as part of this PR. I'll update it shortly and tag you to let you know we're ready for a re-review.

@sadielbartholomew sadielbartholomew changed the title LAMA to Dask: fixes to Data.where, .digitize LAMA to Dask: fixes to Query & Data.digitize Nov 22, 2022
Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
@davidhassell davidhassell self-requested a review November 28, 2022 14:10
@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Dec 5, 2022

Sorry @davidhassell before I went on leave I think I forgot to confirm that this is ready for re-review now after I updated this in line with your suggestion and made some consolidations too. Please let me know what you think.

Copy link
Collaborator

@davidhassell davidhassell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sadie - A nice solution, thanks. I've made a couple of comments, but go ahead and merge when you're ready.

cf/query.py Outdated Show resolved Hide resolved
@sadielbartholomew sadielbartholomew dismissed davidhassell’s stale review December 6, 2022 17:15

Out-of-date now, later comments made and raise an approval.

@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Dec 6, 2022

Thanks for all of your very useful feedback on this @davidhassell. Final round of feedback addressed in my latest commit, so I'll merge as advised and open up one Issue as discussed.

@sadielbartholomew sadielbartholomew merged commit f17e08e into NCAS-CMS:lama-to-dask Dec 6, 2022
@sadielbartholomew sadielbartholomew deleted the lama-to-dask-where-fixes branch December 6, 2022 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Relating to the use of Dask
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants