Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661

Open
shoyer opened this issue Oct 22, 2024 · 17 comments
Open

Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661

shoyer opened this issue Oct 22, 2024 · 17 comments
Labels

Comments

@shoyer
Copy link
Member

shoyer commented Oct 22, 2024

It appears that #9520 may have broken some upstream pandas tests, specifically testing round-trips with various index types:
https://github.com/pandas-dev/pandas/blob/e78ebd3f845c086af1d71c0604701ec49df97228/pandas/tests/generic/test_to_xarray.py#L32

Here's a minimal test case:

import pandas as pd
import numpy as np

cat = pd.Categorical(list("abcd"))
df = pd.DataFrame({"f": cat}, index=cat)
restored = df.to_xarray().to_dataframe()
print(restored.index)  # Index(['a', 'b', 'c', 'd'], dtype='object', name='index')
print(df.index)  # CategoricalIndex(['a', 'b', 'c', 'd'], categories=['a', 'b', 'c', 'd'], ordered=False, dtype='category')

I'm not sure if this is a pandas or xarray issue, but it's one or the other!

(My guess is that most of these tests in pandas should probably live in xarray instead, given that we implement all the conversion logic.)

Originally posted by @shoyer in #9520 (comment)

@shoyer
Copy link
Member Author

shoyer commented Oct 22, 2024

Here's the error message from pandas's TestDataFrameToXArray.test_to_xarray_index_types[string]:

AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are differentAttribute "dtype" are different[left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)[right]: objectself = <pandas.tests.generic.test_to_xarray.TestDataFrameToXArray object at 0x13d4fa7cbe90>index_flat = Index(['pandas_0', 'pandas_1', 'pandas_2', 'pandas_3', 'pandas_4', 'pandas_5',       'pandas_6', 'pandas_7', 'pandas_...pandas_93', 'pandas_94', 'pandas_95',       'pandas_96', 'pandas_97', 'pandas_98', 'pandas_99'],      dtype='object')df = bar       a  b  c    d      e  f          g                         hfoo                                             ....0   True  c 2013-01-03 2013-01-03 00:00:00-05:00pandas_3  d  4  6  7.0  False  d 2013-01-04 2013-01-04 00:00:00-05:00using_infer_string = False    def test_to_xarray_index_types(self, index_flat, df, using_infer_string):        index = index_flat        # MultiIndex is tested in test_to_xarray_with_multiindex        if len(index) == 0:            pytest.skip("Test doesn't make sense for empty index")            from xarray import Dataset            df.index = index[:4]        df.index.name = "foo"        df.columns.name = "bar"        result = df.to_xarray()        assert result.sizes["foo"] == 4        assert len(result.coords) == 1        assert len(result.data_vars) == 8        tm.assert_almost_equal(list(result.coords.keys()), ["foo"])        assert isinstance(result, Dataset)            # idempotency        # datetimes w/tz are preserved        # column names are lost        expected = df.copy()        expected["f"] = expected["f"].astype(            object if not using_infer_string else "string[pyarrow_numpy]"        )        expected.columns.name = None>       tm.assert_frame_equal(result.to_dataframe(), expected)E       AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are differentE       E       Attribute "dtype" are differentE       [left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)E       [right]: objecttests/generic/test_to_xarray.py:58: AssertionError
Failed

<br class="Apple-interchange-newline">AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are different

Attribute "dtype" are different
[left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)
[right]: object
self = <pandas.tests.generic.test_to_xarray.TestDataFrameToXArray object at 0x13d4fa7cbe90>
index_flat = Index(['pandas_0', 'pandas_1', 'pandas_2', 'pandas_3', 'pandas_4', 'pandas_5',
       'pandas_6', 'pandas_7', 'pandas_...pandas_93', 'pandas_94', 'pandas_95',
       'pandas_96', 'pandas_97', 'pandas_98', 'pandas_99'],
      dtype='object')
df = bar       a  b  c    d      e  f          g                         h
foo                                             ....0   True  c 2013-01-03 2013-01-03 00:00:00-05:00
pandas_3  d  4  6  7.0  False  d 2013-01-04 2013-01-04 00:00:00-05:00
using_infer_string = False

    def test_to_xarray_index_types(self, index_flat, df, using_infer_string):
        index = index_flat
        # MultiIndex is tested in test_to_xarray_with_multiindex
        if len(index) == 0:
            pytest.skip("Test doesn't make sense for empty index")
    
        from xarray import Dataset
    
        df.index = index[:4]
        [df.index.name](https://www.google.com/url?q=http://df.index.name&sa=D) = "foo"
        [df.columns.name](https://www.google.com/url?q=http://df.columns.name&sa=D) = "bar"
        result = df.to_xarray()
        assert result.sizes["foo"] == 4
        assert len(result.coords) == 1
        assert len(result.data_vars) == 8
        tm.assert_almost_equal(list(result.coords.keys()), ["foo"])
        assert isinstance(result, Dataset)
    
        # idempotency
        # datetimes w/tz are preserved
        # column names are lost
        expected = df.copy()
        expected["f"] = expected["f"].astype(
            object if not using_infer_string else "string[pyarrow_numpy]"
        )
        [expected.columns.name](https://www.google.com/url?q=http://expected.columns.name&sa=D) = None
>       tm.assert_frame_equal(result.to_dataframe(), expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)
E       [right]: object

tests/generic/test_to_xarray.py:58: AssertionError

@shoyer
Copy link
Member Author

shoyer commented Oct 22, 2024

cc @ilan-gold

@shoyer shoyer added the bug label Oct 22, 2024
@ilan-gold
Copy link
Contributor

ilan-gold commented Oct 23, 2024

On it! More generally @shoyer with this extension array stuff, I would be happy for a zoom call to go over what all the various pandas adapters in the codebase (since I think they can be somewhat cut down as a lot of the code has to do with numpy conversion) and/or sound out running the pandas integration tests in this repo. We are doing that now: https://github.com/scverse/integration-testing/pull/1/files where we check out everyone's repo and then test it against the core data structure on main of both. We could do something similar here if you wanted! That would minimize friction I think (i.e., no need to migrate tests).

@ilan-gold
Copy link
Contributor

@shoyer This issue is too tied up with datetimes, see: #9618. I will need to redo what I've done to work off that branch now. The issue is that pandas>2.0 has their datetime handling as extension arrays - so if we start letting in categorical indices in our indexing adapter, we let everything in, which means we break almost all converting of the datetime stuff.

@dcherian
Copy link
Contributor

dcherian commented Oct 23, 2024

Can we explicitly cast DatetimeArray to datetime64[ns] for now? This won't always work, but we can just error out in that case.

@nataziel
Copy link

This is definitely causing problems on v2024.10.0, I'm now getting an error when going from DataFrame -> DataSet -(error here)> DataArray. I'm starting with a DataFrame with a DateTime index and 20ish columns. Relevant parts of the error trace:

/usr/local/lib/python3.10/dist-packages/xarray/core/dataset.py:7274: in to_dataarray
    data = duck_array_ops.stack([b.data for b in broadcast_vars], axis=0)
/usr/local/lib/python3.10/dist-packages/xarray/core/duck_array_ops.py:384: in stack
    return xp.stack(as_shared_dtype(arrays, xp=xp), axis=axis)
/usr/local/lib/python3.10/dist-packages/xarray/core/extension_array.py:100: in __array_function__
    res = HANDLED_EXTENSION_ARRAY_FUNCTIONS[func](*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

arr = [<FloatingArray>
[-0.7418799885470463, -0.8171209666969853, -0.8805057639294221]
Length: 3, dtype: Float64, <FloatingA...: Float64, <FloatingArray>
[0.8056743037815669, 0.9631290662236986, 0.9960250670661343]
Length: 3, dtype: Float64, ...]
axis = 0

    @implements(np.stack)
    def __extension_duck_array__stack(arr: T_ExtensionArray, axis: int):
>       raise NotImplementedError("Cannot stack 1d-only pandas categorical array.")
E       NotImplementedError: Cannot stack 1d-only pandas categorical array.

/usr/local/lib/python3.10/dist-packages/xarray/core/extension_array.py:41: NotImplementedError

@ilan-gold
Copy link
Contributor

@nataziel Can you send the dataframe you were using? Or ideally a minimal reproducer?

@shoyer I was not away xarray "stacked" pandas 1D objects - we can jujst make it do this with .to_numpy or similar and hope for the best.

@nataziel
Copy link

nataziel commented Oct 25, 2024

df = pd.DataFrame(
    {
        "sin_order_1_year": [-0.7418799885470463, -0.8171209666969853, -0.8805057639294221],
        "date": [
            Timestamp("2022-08-15 00:00:00"),
            Timestamp("2022-08-22 00:00:00"),
            Timestamp("2022-08-29 00:00:00"),
        ],
    },
)
df = df.astype("Float64", errors="ignore")

mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array()

the above works in 2024.9.0 but not 2024.10.0

@nataziel
Copy link

nataziel commented Dec 6, 2024

I was just testing the above with xarray==2024.11.0 and realised it's not as easily reproducible as it could be so try this:

import pandas as pd
import xarray as xr
from pandas import Timestamp

def main():
    df = pd.DataFrame(
    {
            "sin_order_1_year": [-0.7418799885470463, -0.8171209666969853, -0.8805057639294221],
            "date": [
                Timestamp("2022-08-15 00:00:00"),
                Timestamp("2022-08-22 00:00:00"),
                Timestamp("2022-08-29 00:00:00"),
            ],
        },
    )
    df = df.astype("Float64", errors="ignore")

    mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array()


if __name__ == "__main__":
    main()

still failing in 2024.11.0 with this error

Traceback (most recent call last):
  File "C:\Users\me\git\xarray_test\main.py", line 22, in <module>
    main()
    ~~~~^^
  File "C:\Users\me\git\xarray_test\main.py", line 18, in main
    mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array()
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\dataset.py", line 7374, in to_array
    return self.to_dataarray(dim=dim, name=name)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\dataset.py", line 7357, in to_dataarray
    data = duck_array_ops.stack([b.data for b in broadcast_vars], axis=0)
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\duck_array_ops.py", line 397, in stack
    return xp.stack(as_shared_dtype(arrays, xp=xp), axis=axis)
           ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\extension_array.py", line 100, in __array_function__
    res = HANDLED_EXTENSION_ARRAY_FUNCTIONS[func](*args, **kwargs)
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\extension_array.py", line 41, in __extension_duck_array__stack
    raise NotImplementedError("Cannot stack 1d-only pandas categorical array.")
NotImplementedError: Cannot stack 1d-only pandas categorical array.

@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 4, 2025

@nataziel That makes sense to me - you're trying to make a numpy array out of heterogeneous extension array types, no? You tell the dataframe "give me Float64" which converts the sin_order_1_year to a numpy extension array instead of just a numpy array (see https://pandas.pydata.org/docs/dev/reference/api/pandas.Float64Dtype.html) - you can see this by calling df["sin_order_1_year"].values before and after your astype call.

I would convert those two "columns" (or the equivalent on the xarray object) to something that is internally-coherent and numpy-compatible and then call to_array (so not calling astype). The upshot of this whole extension array stuff is that your xarray object keeps its types (hooray). But the downside is that you now have to manage this stuff explicitly (sad) instead of xarray blackmagic. But that means things are easier for maintainers to reason about (hooray).

@nataziel
Copy link

nataziel commented Feb 5, 2025

@ilan-gold it's a contrived example for reproducibility, but the DataFrame is similar to something that was naturally produced in my codebase. I use the np.sin and np.cos functions to generate some arrays, which are of type Float64, then throw them into DataFrames. In a weird coincidence I was working on it yesterday before seeing your reply and fixed it on my end pretty much exactly as you described by manually wrangling the types. I see now and agree that it's not due to the datetime index but the dtype of the columns. That said, the reproducible error still exists.

If I have a Dataset, full of DataArrays with exclusively Float64 dtypes, I'd expect to be able to convert that to an array with .to_dataarray() or .to_array(). The example code works in xarray == "2024.9.0", but doesn't in versions up to 2025.01.2. I acknowledge that this just might not be a supported use case and there are workarounds, but at the very least it's a change in behaviour.

Complete speculation without looking at the xarray internals but maybe you could check if the xarray dtypes to be stacked are homogenous and allow stacking in that case? The error message stating that it's a categorical array is a bit strange also

raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this issue Feb 5, 2025
Tests failing with newer xarray and/or pyreadstat

Author: Richard Shadrach, Rebecca N. Palmer <rebecca_palmer@zoho.com>
Origin: partly pandas-dev/pandas#60109
Bug: partly pydata/xarray#9661
Bug-Debian: https://bugs.debian.org/1088988
Forwarded: no


Gbp-Pq: Name 1088988_xarray_pyreadstat_compat.patch
@ilan-gold
Copy link
Contributor

@nataziel To me it seems the previous behavior was probably not optimal (or even really sensible - I would not think a column name should become a coordinate).

<xarray.DataArray (variable: 1, date: 3)>
array([[-0.74187999, -0.81712097, -0.88050576]])
Coordinates:
  * date      (date) datetime64[ns] 2022-08-15 2022-08-22 2022-08-29
  * variable  (variable) object 'sin_order_1_year'

It's tough to say why this worked but it definitely feels wrong

@nataziel
Copy link

nataziel commented Feb 5, 2025

The column name becoming the coordinate is the expected result no? In the 'fixed' version I'm using now with float64 arrays instead of Float64 arrays it works exactly that way. Dirty reproduction with xarray == '2025.01.2':

>>> import pandas as pd
>>> import xarray as xr
>>> import numpy as np
>>> arrays = np.sin([5, 3]*2)
>>> arrays
array([-0.95892427,  0.14112001, -0.95892427,  0.14112001])
>>> array2 = np.sin([9,5]*2)
>>> array2
array([ 0.41211849, -0.95892427,  0.41211849, -0.95892427])
>>> df = pd.DataFrame({'col1': arrays, 'col2': array2})
>>> df
       col1      col2
0 -0.958924  0.412118
1  0.141120 -0.958924
2 -0.958924  0.412118
3  0.141120 -0.958924
>>> df.dtypes
col1    float64
col2    float64
dtype: object
>>> dset = xr.Dataset.from_dataframe(df)
>>> dset
<xarray.Dataset> Size: 96B
Dimensions:  (index: 4)
Coordinates:
  * index    (index) int64 32B 0 1 2 3
Data variables:
    col1     (index) float64 32B -0.9589 0.1411 -0.9589 0.1411
    col2     (index) float64 32B 0.4121 -0.9589 0.4121 -0.9589
>>> dset.dtypes
Frozen({'col1': dtype('float64'), 'col2': dtype('float64')})
>>> darray = dset.to_array()
>>> darray
<xarray.DataArray (variable: 2, index: 4)> Size: 64B
array([[-0.95892427,  0.14112001, -0.95892427,  0.14112001],
       [ 0.41211849, -0.95892427,  0.41211849, -0.95892427]])
Coordinates:
  * index     (index) int64 32B 0 1 2 3
  * variable  (variable) object 16B 'col1' 'col2'
>>> darray.dtype
dtype('float64')

@ilan-gold
Copy link
Contributor

Ok @nataziel I see what that function does now. I leave it up to the maintainers @shoyer etc. to decide what to do, if extension arrays should be cast to numpy first in the case of a to_array call or if they should error out as they do now pushing that decision to the user.

The fact that this used to work was a symptom of a bug I also fixed: e649e13 in which we were not preserving correct pandas dtypes. But your issue is a different API - should to_array coerce pandas dtypes or not? That is not really up to me IMO. If anything, I would lean towards "no" given the fact that to_array is deprecated and it is now the more explicit to_dataarray in which case this function (despite its previous name) has no mention of numpy casting.

@shoyer
Copy link
Member Author

shoyer commented Feb 7, 2025

I agree that xarray should probably be converting pandas's Float64 into np.float64. Xarray uses nan for missing values in array operations, so mixing that with NA from pandas is likely to lead to confusion.

It's less clear to me if we should automatically convert Int64. Potenitally we could check if there are any missing values before deciding the NumPy dtype, but that is a somewhat expensive operation.

@nataziel
Copy link

nataziel commented Feb 9, 2025

I think we might be talking about different things here and I'm describing a different problem. To be 100% clear, the problem I'm describing is that I have a Dataset with homogenous PandasExtensionArray types across the contained DataArrays and since xarray==2024.10.0 turning that DataSet into a DataArray with DataSet.to_dataarray() fails.

import pandas as pd
import numpy as np
import xarray as xr


def main():
    df1 = pd.DataFrame(
        {
            "val1": [1, 2, 3],
            "val2": [4, 5, 6],
        }
    )

    print(df1)
    print(df1.dtypes)

    dset1 = xr.Dataset.from_dataframe(df1)

    print(dset1)
    print(dset1.dtypes)  # int64

    darray1 = dset1.to_dataarray()  # works fine

    df2 = df1.astype("Int64")

    print(df2.dtypes)

    dset2 = xr.Dataset.from_dataframe(df2)

    print(dset2) # Pandas "Int64" extension arrays
    print(dset2.dtypes) 

    darray2 = dset2.to_dataarray()  # error here


if __name__ == "__main__":
    main()

How I'd like this to work is that if the DataSet contains homogenous types of DataArrays then no type conversion occurs and they just get stacked together. This worked before 2024.10.0

@ilan-gold
Copy link
Contributor

@shoyer Why the difference in approach between the two? It seems like both could have NA in pandas, no?

Maybe to_dataarray could take a keyword arg like homogenize where the promise is that we will just put everything together (maybe subject to numpy type promotion rules), and this arg is default False. I think being super clear about data types and behavior is good.

If there's a precedent i.e., some other function for "xarray will take your data and convert to numpy without any indication", then maybe doing this by default is ok.

raspbian-autopush pushed a commit to raspbian-packages/pandas that referenced this issue Feb 13, 2025
Tests failing with newer xarray and/or pyreadstat

Author: Richard Shadrach, Rebecca N. Palmer <rebecca_palmer@zoho.com>
Origin: partly pandas-dev/pandas#60109
Bug: partly pydata/xarray#9661
Bug-Debian: https://bugs.debian.org/1088988
Forwarded: no


Gbp-Pq: Name 1088988_xarray_pyreadstat_compat.patch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants