Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting equality checking in a way xarray understands #29

Closed
TomNicholas opened this issue Mar 14, 2024 · 1 comment
Closed

Supporting equality checking in a way xarray understands #29

TomNicholas opened this issue Mar 14, 2024 · 1 comment
Labels
xarray Requires changes to xarray upstream

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Mar 14, 2024

It would be useful (especially for testing purposes) to be able to compare xarray datasets containing ManifestArrays for equality.

It's clear when two ManifestArray objects should be considered equivalent - they have the exact same chunkmanifest and zarray info. However, just simply returning this condition from .__eq__ has a few issues:

  1. The array API standard expects .__eq__ to return an entire array of booleans rather than a single scalar. I could make that array, but it seems like a weird thing to do, as you can't actually access the individual elements (only indirect representations of individual chunks). Actually, I could even go further and define equality on a per-chunk basis, returning a numpy array of booleans that would have False at any locations corresponding to non-equivalent ChunkEntry objects in the manifest... But you can't then use this mask to index the array. It also seems like the array API standard expects the returned array type to be Self, i.e. another ManifestArray, not np.ndarray.

  2. Xarray checks equivalence of wrapped arrays in a complicated lazy NaN-aware way that gets tripped up by ManifestArray objects. Following the stack down from xarray.testing.assert_equal, it goes through Variable.equals, duck_array_ops.array_equiv, and calls duck_array_ops.isnull. At this point as ManifestArray doesn't define any form of isnull, an AttributeError is raised, which is caught in Variable.equals here, and False is always returned as the result of the equality assertion.

The problematic section is

flag_array = (arr1 == arr2) | (isnull(arr1) & isnull(arr2))
return bool(flag_array.all())

If this condition were just (arr1 == arr2) then we would be fine...

Inside duck_array_ops.isnull theres a weird part that I don't understand. Copied verbatim:

scalar_type = data.dtype.type
...
elif issubclass(scalar_type, (np.bool_, np.integer, np.character, np.void)):
    # these types cannot represent missing values
    return full_like(data, dtype=bool, fill_value=False)
else:
    # at this point, array should have dtype=object
    if isinstance(data, np.ndarray):
        return pandas_isnull(data)
    else:
        # Not reachable yet, but intended for use with other duck array
        # types. For full consistency with pandas, we should accept None as
        # a null value as well as NaN, but it isn't clear how to do this
        # with duck typing.
        return data != data

This is going to behave differently for float vs integer dtypes. To support both of these cases, we presumably want to implement both full_like and isnan, and have them both return arrays of False on all ManifestArrays...

@TomNicholas TomNicholas added the xarray Requires changes to xarray upstream label Mar 14, 2024
@TomNicholas
Copy link
Member Author

Mostly closed by #30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
xarray Requires changes to xarray upstream
Projects
None yet
Development

No branches or pull requests

1 participant