-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data.equals
: add unit test & migrate to Dask
#254
Data.equals
: add unit test & migrate to Dask
#254
Conversation
Hi @davidhassell, RE #254 (comment) and the (intermediate, towards this PR) daskification of On
|
Also fix some issues in the functions module so flake8 passes
aa3524b
to
309e91d
Compare
(FYI @davidhassell, as you may have seen from notifications, I have pushed some new commits here, but please don't look at this until I tag you again to let you know it is ready - I have prepared review aids that I will share but I am thinking over some aspects before I finalise everything and open this for review.) |
Hi again @davidhassell, all ready now, though depending on your thoughts regarding the approach to take with regards to laziness of Please also see below for notes conveying the context of the PR and our discussions, since they took place a while back now so it is good to note down our thoughts. Note also I will push up a commit to add a test for the new Notes for contextAfter our September discussion we realised we should amend the task graph from that suggested in #254 (comment) to the following, where I have drawn a red box to outline the key part and that which is different to the previous graph: and I have implemented that. The overall logic corresponds to this interactive example, in case you wished to play around interactively with it: In [1]: import dask.array as da
...: import numpy as np
...: x = np.ma.array([1, 2, 3], mask=[1, 0, 0])
...: y = np.ma.array([999, 2, 3], mask=[1, 0, 0])
...: dx = da.from_array(x)
...: dy = da.from_array(y)
In [2]: dx_mask = da.ma.getmaskarray(dx)
...: dy_mask = da.ma.getmaskarray(dy)
...: mask_comparison = da.equal(dx_mask, dy_mask)
In [3]: def da_ma_allclose(x, y, masked_equal=True, rtol=1e-05, atol=1e-08):
...: x = da.asanyarray(x)
...: y = da.asanyarray(y)
...: return da.map_blocks(
...: np.ma.allclose, x, y, masked_equal=masked_equal, rtol=rtol,
...: atol=atol,
...: )
...:
In [4]: data_comparison = da_ma_allclose(dx, dy, masked_equal=False)
In [5]: result = da.all(da.logical_and(mask_comparison, data_comparison))
In [6]: result.visualize(filename="cf_equals_test_003.png")
Out[6]: <IPython.core.display.Image object>
In [7]: result.compute()
Out[7]: False Notice the graph from the above is as desired: |
Hi Sadie - I think the logic is sound, but unfortunately it doesn't work for me if dx = da.from_array(x, chunks=(1, 2))
dy = da.from_array(y, chunks=(3,)) which is a limitation/feature of After some messing about I came up with: import dask.array as da
import numpy as np
def allclose(a_blocks, b_blocks, rtol=1e-05, atol=1e-08):
result = True
for a, b in zip(a_blocks, b_blocks):
result &= np.ma.allclose(a, b, rtol=rtol, atol=atol, masked_equal=True)
return result
# -----------------------------------------
# Case 1
# -----------------------------------------
x = np.ma.array([1, 2, 3], mask=[1, 0, 0])
y = np.ma.array([999, 2, 3], mask=[1, 0, 0])
dx = da.from_array(x, chunks=(1, 2))
dy = da.from_array(y, chunks=(3,))
dx_mask = da.ma.getmaskarray(dx)
dy_mask = da.ma.getmaskarray(dy)
mask_comparison = da.allclose(dx_mask, dy_mask)
axes = tuple(range(dx.ndim))
data_comparison = da.blockwise(
allclose, '', dx, axes, dy, axes, dtype=bool,
rtol=1e-05, atol=1e-08
)
result = mask_comparison & data_comparison
print('CASE 1')
print('mask equal?', mask_comparison.compute())
print('data equal?', data_comparison.compute())
print(' equal?', result.compute())
# -----------------------------------------
# Case 2
# -----------------------------------------
x = np.ma.array([1, 999, 3], mask=[0, 1, 0]) # Different
y = np.ma.array([999, 2, 3], mask=[1, 0, 0])
dx = da.from_array(x, chunks=(1, 2))
dy = da.from_array(y, chunks=(3,))
dx_mask = da.ma.getmaskarray(dx)
dy_mask = da.ma.getmaskarray(dy)
mask_comparison = da.allclose(dx_mask, dy_mask)
axes = tuple(range(dx.ndim))
data_comparison = da.blockwise(
allclose, '', dx, axes, dy, axes, dtype=bool,
rtol=1e-05, atol=1e-08
)
result = mask_comparison & data_comparison
print('\nCASE 2')
print('mask equal?', mask_comparison.compute())
print('data equal?', data_comparison.compute())
print(' equal?', result.compute()) which produces, as required: CASE 1
mask equal? True
data equal? True
equal? True
CASE 2
mask equal? False
data equal? True
equal? False This assume that What do you think? |
HI @davidhassell, thanks for your detailed (re-)review, as ever. Good spot regarding the chunk sizes - I hoped Dask would handle all of that, but I guess with our custom Dask method we need to be careful. I will add in a test for the chunk sizes akin to your examples, and a proper unit test for the new I'll start that all after my dinner. I am also about to push up further developer notes as quite a lot of other important aspects were discussed in relation to this PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Sadie - I think we're pretty much there, excellent work.
I have "commented" rather than "requested changes", as we might need to sign off on a couple of things (perhaps the string data type question?).
Thanks,
David
# top-level dtype in the NumPy dtype hierarchy; see the | ||
# 'Hierarchy of type objects' figure diagram under: | ||
# https://numpy.org/doc/stable/reference/arrays.scalars.html#scalars | ||
return np.issubdtype(dtype, np.number) or np.issubdtype(dtype, np.bool_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this conversation is now resolved in the light of subsequent conversations, and the agreement to introduce new keywords (such as equal_nan
).
cf/test/test_Data.py
Outdated
d2 = cf.Data(a.astype(np.float32), "m") # different datatype to d | ||
self.assertTrue(d2.equals(d2)) | ||
with self.assertLogs(level=cf.log_level().value) as catch: | ||
self.assertFalse(d2.equals(d)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I agree that d2.equals(d))
is False
!
Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
Thanks once again for the feedback, @davidhassell. All should be resolved now, noting that for the issue outlined in the thread here I have now adjusted the test now so it tests for the agreed behaviour, but commented out the line with the assertion in question, which currently fails, with a note that the correct behaviour will be added via cfdm: So it isn't resolved in the sense of being implemented, but is ready in terms of this PR: I will open up a follow-up issue on cfdm to cover it. |
Just had a thought - it'd be good to set all of the dask arrays in all of the tests to have multiple dask chunks: d = cf.Data(a, "m", chunks=(2, 2) |
Right-o, good idea. One moment and I'll push a commit doing that. |
@davidhassell since my last comment I've:
Overall, I think we are ready to merge (ideally after we address the more general linting failures so the CI jobs can run cleanly). Outstanding linting failuresWhat are/were $ pre-commit run --all-files
Check python ast.........................................................Passed
Debug Statements (Python)................................................Passed
Fix End of Files.........................................................Passed
Trim Trailing Whitespace.................................................Passed
black....................................................................Passed
docformatter.............................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1
cf/data/creation.py:137:12: F821 undefined name 'is_small'
cf/data/creation.py:196:12: F821 undefined name 'is_very_small'
cf/data/creation.py:234:12: F821 undefined name 'is_small'
cf/data/creation.py:240:12: F821 undefined name 'is_small'
cf/data/creation.py:266:20: F821 undefined name 'is_small'
isort (python)...........................................................Passed
isort (cython).......................................(no files to check)Skipped
isort (pyi)..........................................(no files to check)Skipped |
Hi Sadie,
I think "were"! and the easiest thing is probably just to comment out those lines - that won't stop any (relevant) units tests from passing, and they'll get wiped anyway when #297 is merged. |
Sorry, I realise now I had worded that badly - obviously they have gone as they are not defined, but by 'were' I meant to question whether it has gone AWOL and if so how to re-add it 🙂
Aha, sure - I didn't realise #297 would touch them. Will remove those and then we might be ready to merge... |
OK just going to trigger the CI jobs via open-close, which should now pass as everything is good locally, including the linting... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All done!
Fix #251 & address the
Data.equals
method for #182.