Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data.equals: add unit test & migrate to Dask #254

Merged

Conversation

sadielbartholomew
Copy link
Member

Fix #251 & address the Data.equals method for #182.

@sadielbartholomew sadielbartholomew self-assigned this Aug 27, 2021
@sadielbartholomew sadielbartholomew marked this pull request as ready for review September 1, 2021 02:33
cf/data/data.py Outdated Show resolved Hide resolved
cf/data/data.py Outdated Show resolved Hide resolved
cf/data/data.py Outdated Show resolved Hide resolved
cf/data/data.py Outdated Show resolved Hide resolved
@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Sep 3, 2021

Hi @davidhassell, RE #254 (comment) and the (intermediate, towards this PR) daskification of _numpy_allclose (which we should rename I think as per our comments in that thread), I have been investigating approaches and the following is a summary of my thoughts so far.

On _numpy_allclose

Currently (pre-migration):

  1. Checks if a and b are MaskedArray objects.
  2. If neither is MaskedArray: np.allclose(a, b) -> done.
  3. Else if both are a MaskedArray: compare masks, if different -> False, done.
  4. Otherwise, if only one is a MaskedArray: check if either has a non-trivial (False) mask and if so, they have different masks, so False -> done.
  5. Otherwise, at least one is a MaskedArray but the masks are the same, so need to run np.ma.allclose to compare data. -> done.

To migrate:

I am not yet sure if this suggestion is the best approach, particularly as I am just getting to grips with all of the possibilities provided with Dask and which are more efficient to use and advisable to use in what cases, but it is at least a starting point...

  1. Check the type of the underlying array on a and b (this is type(dx._meta), as reported in the repr under the name chunktype) to determine if either are numpy.MaskedArray rather than numpy.ndarray (those are the only two options, here, right, since it the method is only applied for numpy arrays I guess?).
  2. If one is MaskedArray and the other ndarray, is False -> done.
  3. If both are numpy.ndarray, just use dask.allclose -> done.
  4. If both are MaskedArray, it's a bit more complicated, but doing something along the lines of the following does seem to work and the visualize graph (included below) seems promising, though of course at both of the equal and isclose steps, it will have to wait on both possible parallel task routes which is not ideal. I guess we would want to do the following blockwise, though that is not incorporated into the demo code:
In [1]: import dask.array as da

In [2]: import numpy as np

In [3]: x = np.ma.array([1, 2, 3], mask=[0, 1, 0])

In [4]: y = np.ma.array([999, 2, 3], mask=[1, 0, 0])

In [5]: dx = da.from_array(x)

In [6]: dy = da.from_array(y)

In [7]: dx_mask = da.ma.getmaskarray(dx)

In [8]: dy_mask = da.ma.getmaskarray(dy)

In [9]: dx_data = da.ma.getdata(dx)

In [10]: dy_data = da.ma.getdata(dy)

In [11]: mask_comparison = da.equal(dx_mask, dy_mask)

In [13]: data_comparison = da.isclose(dx_data, dy_data)

In [14]: result = da.all(da.logical_and(mask_comparison, data_comparison))

In [15]: result.visualize(filename="test007.png")
Out[15]: <IPython.core.display.Image object>

In [16]: result.compute()
Out[16]: False

In [17]: def compare(f, g):
    ...:     mask_comparison = da.equal(da.ma.getmaskarray(f), da.ma.getmaskarray(g))
    ...:     data_comparison = da.isclose(da.ma.getdata(f), da.ma.getdata(g))
    ...:     result = da.all(da.logical_and(mask_comparison, data_comparison))
    ...:     return result
    ...: 

In [19]: compare(np.ma.array([1, 2, 3], mask=[0, 1, 0]), np.ma.array([1, 2, 3],
    ...: mask=[0, 1, 0])).compute()
Out[19]: True

In [20]: compare(np.ma.array([1, 2, 3], mask=[0, 1, 0]), np.ma.array([1, 2, 3],
    ...: mask=[0, 1, 1])).compute()
Out[20]: False

In [21]: compare(np.ma.array([1, 2, 3], mask=[0, 1, 0]), np.ma.array([5, 2, 3],
    ...: mask=[0, 1, 0])).compute()
Out[21]: False

In [22]: compare(np.ma.array([1, 2, 3]), np.ma.array([5, 2, 3])).compute()
Out[22]: False

In [23]: compare(np.ma.array([1, 2, 3]), np.ma.array([1, 2, 3])).compute()
Out[23]: True

As the outputs indicate, the process given in (4) could also apply for the general case, i.e. of any combo. of ndarray or MaskedArray, but given the need to process both a mask and a data array it won't be as efficient even under a dask task workflow, surely.

I should add that Dask doesn't (yet) have a da.ma.allclose method available that could be used instead of something like the above.

Task graph from the above process for step (4)

test007

@davidhassell davidhassell added the dask Relating to the use of Dask label Sep 30, 2021
@sadielbartholomew
Copy link
Member Author

(FYI @davidhassell, as you may have seen from notifications, I have pushed some new commits here, but please don't look at this until I tag you again to let you know it is ready - I have prepared review aids that I will share but I am thinking over some aspects before I finalise everything and open this for review.)

@sadielbartholomew
Copy link
Member Author

Hi again @davidhassell, all ready now, though depending on your thoughts regarding the approach to take with regards to laziness of cf.Data (see the questions and comments I have added to our developer notes in 7a01cbe) then I may want to revert the penultimate commit which converts the overall approach to Case 1 (as covered in those notes) whereas previously I was abiding by Case 2.

Please also see below for notes conveying the context of the PR and our discussions, since they took place a while back now so it is good to note down our thoughts.

Note also I will push up a commit to add a test for the new _da_ma_allclose internal method in a moment, but otherwise I am done here bar any changes relating to the discussion of approach to laziness, as above.


Notes for context

After our September discussion we realised we should amend the task graph from that suggested in #254 (comment) to the following, where I have drawn a red box to outline the key part and that which is different to the previous graph:

20211109_001606

and I have implemented that. The overall logic corresponds to this interactive example, in case you wished to play around interactively with it:

In [1]: import dask.array as da
   ...: import numpy as np
   ...: x = np.ma.array([1, 2, 3], mask=[1, 0, 0])
   ...: y = np.ma.array([999, 2, 3], mask=[1, 0, 0])
   ...: dx = da.from_array(x)
   ...: dy = da.from_array(y)

In [2]: dx_mask = da.ma.getmaskarray(dx)
   ...: dy_mask = da.ma.getmaskarray(dy)
   ...: mask_comparison = da.equal(dx_mask, dy_mask)

In [3]: def da_ma_allclose(x, y, masked_equal=True, rtol=1e-05, atol=1e-08):
   ...:     x = da.asanyarray(x)
   ...:     y = da.asanyarray(y)
   ...:     return da.map_blocks(
   ...:         np.ma.allclose, x, y, masked_equal=masked_equal, rtol=rtol,
   ...:         atol=atol,
   ...:     )
   ...: 

In [4]: data_comparison = da_ma_allclose(dx, dy, masked_equal=False)

In [5]: result = da.all(da.logical_and(mask_comparison, data_comparison))

In [6]: result.visualize(filename="cf_equals_test_003.png")
Out[6]: <IPython.core.display.Image object>

In [7]: result.compute()
Out[7]: False

Notice the graph from the above is as desired:

cf_equals_test_003

@davidhassell
Copy link
Collaborator

Hi Sadie - I think the logic is sound, but unfortunately it doesn't work for me if dx and dy have different chunk sizes

dx = da.from_array(x, chunks=(1, 2))
dy = da.from_array(y, chunks=(3,))

which is a limitation/feature of map_blocks, I think.

After some messing about I came up with:

import dask.array as da
import numpy as np

def allclose(a_blocks, b_blocks, rtol=1e-05, atol=1e-08):
    result = True
    for a, b in zip(a_blocks, b_blocks):
        result &= np.ma.allclose(a, b, rtol=rtol, atol=atol, masked_equal=True)

    return result

# -----------------------------------------
# Case 1
# -----------------------------------------
x = np.ma.array([1,   2, 3], mask=[1, 0, 0])
y = np.ma.array([999, 2, 3], mask=[1, 0, 0])
dx = da.from_array(x, chunks=(1, 2))
dy = da.from_array(y, chunks=(3,))

dx_mask = da.ma.getmaskarray(dx)
dy_mask = da.ma.getmaskarray(dy)
mask_comparison = da.allclose(dx_mask, dy_mask)

axes = tuple(range(dx.ndim))

data_comparison = da.blockwise(
    allclose, '', dx, axes, dy, axes, dtype=bool,
    rtol=1e-05, atol=1e-08
)

result = mask_comparison & data_comparison

print('CASE 1')
print('mask equal?', mask_comparison.compute())
print('data equal?', data_comparison.compute())
print('     equal?', result.compute())

# -----------------------------------------
# Case 2
# -----------------------------------------
x = np.ma.array([1,   999, 3], mask=[0, 1, 0])   # Different
y = np.ma.array([999,   2, 3], mask=[1, 0, 0])
dx = da.from_array(x, chunks=(1, 2))
dy = da.from_array(y, chunks=(3,))

dx_mask = da.ma.getmaskarray(dx)
dy_mask = da.ma.getmaskarray(dy)
mask_comparison = da.allclose(dx_mask, dy_mask)

axes = tuple(range(dx.ndim))

data_comparison = da.blockwise(
    allclose, '', dx, axes, dy, axes, dtype=bool,
    rtol=1e-05, atol=1e-08
)

result = mask_comparison & data_comparison

print('\nCASE 2')
print('mask equal?', mask_comparison.compute())
print('data equal?', data_comparison.compute())
print('     equal?', result.compute())

which produces, as required:

CASE 1
mask equal? True
data equal? True
     equal? True

CASE 2
mask equal? False
data equal? True
     equal? False

This assume that dx and dy have the same shape, but we'll already have checked that and short circuited if they haven't.

What do you think?

@sadielbartholomew
Copy link
Member Author

HI @davidhassell, thanks for your detailed (re-)review, as ever. Good spot regarding the chunk sizes - I hoped Dask would handle all of that, but I guess with our custom Dask method we need to be careful. I will add in a test for the chunk sizes akin to your examples, and a proper unit test for the new allclose method, first of all. Then try out your suggested solution - thanks for investigating and sharing it.

I'll start that all after my dinner. I am also about to push up further developer notes as quite a lot of other important aspects were discussed in relation to this PR!

Copy link
Collaborator

@davidhassell davidhassell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sadie - I think we're pretty much there, excellent work.

I have "commented" rather than "requested changes", as we might need to sign off on a couple of things (perhaps the string data type question?).

Thanks,
David

cf/data/utils.py Outdated Show resolved Hide resolved
# top-level dtype in the NumPy dtype hierarchy; see the
# 'Hierarchy of type objects' figure diagram under:
# https://numpy.org/doc/stable/reference/arrays.scalars.html#scalars
return np.issubdtype(dtype, np.number) or np.issubdtype(dtype, np.bool_)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this conversation is now resolved in the light of subsequent conversations, and the agreement to introduce new keywords (such as equal_nan).

cf/test/test_Data.py Outdated Show resolved Hide resolved
d2 = cf.Data(a.astype(np.float32), "m") # different datatype to d
self.assertTrue(d2.equals(d2))
with self.assertLogs(level=cf.log_level().value) as catch:
self.assertFalse(d2.equals(d))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I agree that d2.equals(d)) is False!

cf/test/test_Data.py Show resolved Hide resolved
cf/test/test_Data.py Outdated Show resolved Hide resolved
cf/data/dask_utils.py Outdated Show resolved Hide resolved
cf/data/dask_utils.py Outdated Show resolved Hide resolved
cf/test/test_Data_utils.py Show resolved Hide resolved
@sadielbartholomew
Copy link
Member Author

Thanks once again for the feedback, @davidhassell. All should be resolved now, noting that for the issue outlined in the thread here I have now adjusted the test now so it tests for the agreed behaviour, but commented out the line with the assertion in question, which currently fails, with a note that the correct behaviour will be added via cfdm:

https://github.com/NCAS-CMS/cf-python/pull/254/files#diff-bc5ca24d93eb1dce7c459430c3decfc73dd85fded3face2985e76b9565cf6881R284-R287

So it isn't resolved in the sense of being implemented, but is ready in terms of this PR: I will open up a follow-up issue on cfdm to cover it.

@davidhassell
Copy link
Collaborator

Just had a thought - it'd be good to set all of the dask arrays in all of the tests to have multiple dask chunks:

d = cf.Data(a, "m", chunks=(2, 2)

@sadielbartholomew
Copy link
Member Author

Just had a thought - it'd be good to set all of the dask arrays in all of the tests to have multiple dask chunks:

Right-o, good idea. One moment and I'll push a commit doing that.

@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Jan 20, 2022

@davidhassell since my last comment I've:

  • added dask chunks on all cf.Data calls in test_Data_equals, and I will add chunks to the other Data-module tests as I go along with daskification (and can throw chunks into the other already-ready tests in a lone commit to push post-merge);
  • handled the usgae of the daskified decorator to reduce the verbosity of the daskification monitoring messages;
  • resolved most of the linting failures, despite many of them being not related to the PR and e.g. residual from the recent merge of master or from your initial LAMA to Dask work, though I couldn't fully resolve the linting to get flake8 to pass via pre-commit run --all-files because I wasn't sure of the nature of two undefined variables (see detail given below): please could you clarify so I can rectify it and the linting tests will pass?

Overall, I think we are ready to merge (ideally after we address the more general linting failures so the CI jobs can run cleanly).

Outstanding linting failures

What are/were is_small and is_very_small? Can they go or does some logic need to be (re-)added?

$ pre-commit run --all-files
Check python ast.........................................................Passed
Debug Statements (Python)................................................Passed
Fix End of Files.........................................................Passed
Trim Trailing Whitespace.................................................Passed
black....................................................................Passed
docformatter.............................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

cf/data/creation.py:137:12: F821 undefined name 'is_small'
cf/data/creation.py:196:12: F821 undefined name 'is_very_small'
cf/data/creation.py:234:12: F821 undefined name 'is_small'
cf/data/creation.py:240:12: F821 undefined name 'is_small'
cf/data/creation.py:266:20: F821 undefined name 'is_small'

isort (python)...........................................................Passed
isort (cython).......................................(no files to check)Skipped
isort (pyi)..........................................(no files to check)Skipped

@davidhassell
Copy link
Collaborator

Hi Sadie,

What are/were is_small and is_very_small? Can they go or does some logic need to be (re-)added?

I think "were"! and the easiest thing is probably just to comment out those lines - that won't stop any (relevant) units tests from passing, and they'll get wiped anyway when #297 is merged.

@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Jan 21, 2022

I think "were"!

Sorry, I realise now I had worded that badly - obviously they have gone as they are not defined, but by 'were' I meant to question whether it has gone AWOL and if so how to re-add it 🙂

and the easiest thing is probably just to comment out those lines - that won't stop any (relevant) units tests from passing, and they'll get wiped anyway when #297 is merged.

Aha, sure - I didn't realise #297 would touch them. Will remove those and then we might be ready to merge...

@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Jan 21, 2022

OK just going to trigger the CI jobs via open-close, which should now pass as everything is good locally, including the linting...

Copy link
Collaborator

@davidhassell davidhassell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Relating to the use of Dask
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants