Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xr.testing.assert_equal does not test for dtype #4727

Open
mathause opened this issue Dec 23, 2020 · 5 comments
Open

xr.testing.assert_equal does not test for dtype #4727

mathause opened this issue Dec 23, 2020 · 5 comments

Comments

@mathause
Copy link
Collaborator

In #4622 @toddrjen points out that xr.testing.assert_equal does not test for the dtype, only for the value. Therefore the following does not raise an error:

import numpy as np
import xarray as xr
import pandas as pd

xr.testing.assert_equal(
    xr.DataArray(np.array(1, dtype=int)), xr.DataArray(np.array(1, dtype=float))
)
xr.testing.assert_equal(
    xr.DataArray(np.array(1, dtype=int)), xr.DataArray(np.array(1, dtype=object))
)
xr.testing.assert_equal(
    xr.DataArray(np.array("a", dtype=str)), xr.DataArray(np.array("a", dtype=object))
)

This comes back to numpy, i.e. the following is True:

np.array(1, dtype=int) == np.array(1, dtype=float)

Depending on the situation one or the other is desirable or not. Thus, I would suggest to add a check_dtype argument to xr.testing.assert_equal and also to DataArray.equals (and Dataset and Variable and identical). I have not seen such an option in numpy, but pandas has it (e.g. pd.testing.assert_series_equal(left, right, check_dtype=True, ...) . I would not change __eq__.

  • Thoughts?
  • What should the default be? We could try True first and see how many failures this creates?
  • What to do with coords and indexes? pd.testing.assert_series_equal has a check_index_type keyword. Probably we need check_coords_type as well? This makes the whole thing much more complicated... Also Coordinate dtype changing to object after xr.concat #4543
@dcherian
Copy link
Contributor

+1 on adding check_dtype. I think it should be default False for assert_equal.

It could be default True for assert_identical but this would be very backward incompatible.

@keewis
Copy link
Collaborator

keewis commented Dec 23, 2020

this has come up in #3706 (comment) before.

@toddrjen
Copy link
Contributor

My concern with assert_identical is the name. It implies, to me, that there is no difference at all between the two objects. It was highly unexpected for me that it didn't do that. I think at the very least it should be clarified in the documentation that it doesn't do that.

If the default for assert_identical isn't change, I wonder whether a new function might be worthwhile. I am concerned having to append check_dtype=True for every test would hurt test clarity. And there is also the problem with

Also, just checking dtype won't be sufficient in all cases. Consider this:

import numpy as np 
import xarray as xr 

a = xr.DataArray(np.array(1.0, dtype=np.object))                                                                                                                                                                                          
b = xr.DataArray(np.array(1, dtype=np.object))                                                                                                                                                                                            

xr.testing.assert_identical(a, b)  

I think for the purpose of testing being able to make sure the result is exactly what you expect is important.

@mathause
Copy link
Collaborator Author

mathause commented Dec 24, 2020

There seems to be a subtle difference between comparing Dataset and DataArray.

DataArray compares coords:

return utils.dict_equiv(self.coords, other.coords, compat=compat) and compat(

while Dataset compares _variables:

xarray/xarray/core/dataset.py

Lines 1457 to 1459 in 03d8d56

return self._coord_names == other._coord_names and utils.dict_equiv(
self._variables, other._variables, compat=compat
)

Thus once we compare xarray.core.variable.IndexVariable and once we compare xarray.core.dataarray.DataArray and the comparison takes different code paths...

import xarray as xr
air = xr.tutorial.open_dataset("air_temperature")

type(air._variables["lat"])
# xarray.core.variable.IndexVariable

type(air.air.coords["lat"])
# xarray.core.dataarray.DataArray

edit: typo

@max-sixty
Copy link
Collaborator

I agree with @toddrjen that the words of assert_identical strongly suggest that the dtypes would be checked.

I would suggest we either:

  • Fail on different dtypes in 0.17
  • If there are lots of breaks / downstream libraries that may break: add a FutureWarning on different dtypes and change in the future

Happy to help on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants