Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661

shoyer · 2024-10-22T18:01:19Z

It appears that #9520 may have broken some upstream pandas tests, specifically testing round-trips with various index types:
https://github.com/pandas-dev/pandas/blob/e78ebd3f845c086af1d71c0604701ec49df97228/pandas/tests/generic/test_to_xarray.py#L32

Here's a minimal test case:

import pandas as pd
import numpy as np

cat = pd.Categorical(list("abcd"))
df = pd.DataFrame({"f": cat}, index=cat)
restored = df.to_xarray().to_dataframe()
print(restored.index)  # Index(['a', 'b', 'c', 'd'], dtype='object', name='index')
print(df.index)  # CategoricalIndex(['a', 'b', 'c', 'd'], categories=['a', 'b', 'c', 'd'], ordered=False, dtype='category')

I'm not sure if this is a pandas or xarray issue, but it's one or the other!

(My guess is that most of these tests in pandas should probably live in xarray instead, given that we implement all the conversion logic.)

Originally posted by @shoyer in #9520 (comment)

shoyer · 2024-10-22T18:01:40Z

Here's the error message from pandas's TestDataFrameToXArray.test_to_xarray_index_types[string]:

AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are differentAttribute "dtype" are different[left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)[right]: objectself = <pandas.tests.generic.test_to_xarray.TestDataFrameToXArray object at 0x13d4fa7cbe90>index_flat = Index(['pandas_0', 'pandas_1', 'pandas_2', 'pandas_3', 'pandas_4', 'pandas_5',       'pandas_6', 'pandas_7', 'pandas_...pandas_93', 'pandas_94', 'pandas_95',       'pandas_96', 'pandas_97', 'pandas_98', 'pandas_99'],      dtype='object')df = bar       a  b  c    d      e  f          g                         hfoo                                             ....0   True  c 2013-01-03 2013-01-03 00:00:00-05:00pandas_3  d  4  6  7.0  False  d 2013-01-04 2013-01-04 00:00:00-05:00using_infer_string = False    def test_to_xarray_index_types(self, index_flat, df, using_infer_string):        index = index_flat        # MultiIndex is tested in test_to_xarray_with_multiindex        if len(index) == 0:            pytest.skip("Test doesn't make sense for empty index")            from xarray import Dataset            df.index = index[:4]        df.index.name = "foo"        df.columns.name = "bar"        result = df.to_xarray()        assert result.sizes["foo"] == 4        assert len(result.coords) == 1        assert len(result.data_vars) == 8        tm.assert_almost_equal(list(result.coords.keys()), ["foo"])        assert isinstance(result, Dataset)            # idempotency        # datetimes w/tz are preserved        # column names are lost        expected = df.copy()        expected["f"] = expected["f"].astype(            object if not using_infer_string else "string[pyarrow_numpy]"        )        expected.columns.name = None>       tm.assert_frame_equal(result.to_dataframe(), expected)E       AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are differentE       E       Attribute "dtype" are differentE       [left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)E       [right]: objecttests/generic/test_to_xarray.py:58: AssertionError
Failed

<br class="Apple-interchange-newline">AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are different

Attribute "dtype" are different
[left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)
[right]: object
self = <pandas.tests.generic.test_to_xarray.TestDataFrameToXArray object at 0x13d4fa7cbe90>
index_flat = Index(['pandas_0', 'pandas_1', 'pandas_2', 'pandas_3', 'pandas_4', 'pandas_5',
       'pandas_6', 'pandas_7', 'pandas_...pandas_93', 'pandas_94', 'pandas_95',
       'pandas_96', 'pandas_97', 'pandas_98', 'pandas_99'],
      dtype='object')
df = bar       a  b  c    d      e  f          g                         h
foo                                             ....0   True  c 2013-01-03 2013-01-03 00:00:00-05:00
pandas_3  d  4  6  7.0  False  d 2013-01-04 2013-01-04 00:00:00-05:00
using_infer_string = False

    def test_to_xarray_index_types(self, index_flat, df, using_infer_string):
        index = index_flat
        # MultiIndex is tested in test_to_xarray_with_multiindex
        if len(index) == 0:
            pytest.skip("Test doesn't make sense for empty index")
    
        from xarray import Dataset
    
        df.index = index[:4]
        [df.index.name](https://www.google.com/url?q=http://df.index.name&sa=D) = "foo"
        [df.columns.name](https://www.google.com/url?q=http://df.columns.name&sa=D) = "bar"
        result = df.to_xarray()
        assert result.sizes["foo"] == 4
        assert len(result.coords) == 1
        assert len(result.data_vars) == 8
        tm.assert_almost_equal(list(result.coords.keys()), ["foo"])
        assert isinstance(result, Dataset)
    
        # idempotency
        # datetimes w/tz are preserved
        # column names are lost
        expected = df.copy()
        expected["f"] = expected["f"].astype(
            object if not using_infer_string else "string[pyarrow_numpy]"
        )
        [expected.columns.name](https://www.google.com/url?q=http://expected.columns.name&sa=D) = None
>       tm.assert_frame_equal(result.to_dataframe(), expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)
E       [right]: object

tests/generic/test_to_xarray.py:58: AssertionError

shoyer · 2024-10-22T18:02:03Z

cc @ilan-gold

ilan-gold · 2024-10-23T09:49:21Z

On it! More generally @shoyer with this extension array stuff, I would be happy for a zoom call to go over what all the various pandas adapters in the codebase (since I think they can be somewhat cut down as a lot of the code has to do with numpy conversion) and/or sound out running the pandas integration tests in this repo. We are doing that now: https://github.com/scverse/integration-testing/pull/1/files where we check out everyone's repo and then test it against the core data structure on main of both. We could do something similar here if you wanted! That would minimize friction I think (i.e., no need to migrate tests).

ilan-gold · 2024-10-23T13:40:07Z

@shoyer This issue is too tied up with datetimes, see: #9618. I will need to redo what I've done to work off that branch now. The issue is that pandas>2.0 has their datetime handling as extension arrays - so if we start letting in categorical indices in our indexing adapter, we let everything in, which means we break almost all converting of the datetime stuff.

dcherian · 2024-10-23T16:38:08Z

Can we explicitly cast DatetimeArray to datetime64[ns] for now? This won't always work, but we can just error out in that case.

nataziel · 2024-10-25T04:52:43Z

This is definitely causing problems on v2024.10.0, I'm now getting an error when going from DataFrame -> DataSet -(error here)> DataArray. I'm starting with a DataFrame with a DateTime index and 20ish columns. Relevant parts of the error trace:

/usr/local/lib/python3.10/dist-packages/xarray/core/dataset.py:7274: in to_dataarray
    data = duck_array_ops.stack([b.data for b in broadcast_vars], axis=0)
/usr/local/lib/python3.10/dist-packages/xarray/core/duck_array_ops.py:384: in stack
    return xp.stack(as_shared_dtype(arrays, xp=xp), axis=axis)
/usr/local/lib/python3.10/dist-packages/xarray/core/extension_array.py:100: in __array_function__
    res = HANDLED_EXTENSION_ARRAY_FUNCTIONS[func](*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

arr = [<FloatingArray>
[-0.7418799885470463, -0.8171209666969853, -0.8805057639294221]
Length: 3, dtype: Float64, <FloatingA...: Float64, <FloatingArray>
[0.8056743037815669, 0.9631290662236986, 0.9960250670661343]
Length: 3, dtype: Float64, ...]
axis = 0

    @implements(np.stack)
    def __extension_duck_array__stack(arr: T_ExtensionArray, axis: int):
>       raise NotImplementedError("Cannot stack 1d-only pandas categorical array.")
E       NotImplementedError: Cannot stack 1d-only pandas categorical array.

/usr/local/lib/python3.10/dist-packages/xarray/core/extension_array.py:41: NotImplementedError

ilan-gold · 2024-10-25T05:36:58Z

@nataziel Can you send the dataframe you were using? Or ideally a minimal reproducer?

@shoyer I was not away xarray "stacked" pandas 1D objects - we can jujst make it do this with .to_numpy or similar and hope for the best.

nataziel · 2024-10-25T12:08:31Z

df = pd.DataFrame(
    {
        "sin_order_1_year": [-0.7418799885470463, -0.8171209666969853, -0.8805057639294221],
        "date": [
            Timestamp("2022-08-15 00:00:00"),
            Timestamp("2022-08-22 00:00:00"),
            Timestamp("2022-08-29 00:00:00"),
        ],
    },
)
df = df.astype("Float64", errors="ignore")

mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array()

the above works in 2024.9.0 but not 2024.10.0

nataziel · 2024-12-06T01:20:07Z

I was just testing the above with xarray==2024.11.0 and realised it's not as easily reproducible as it could be so try this:

import pandas as pd
import xarray as xr
from pandas import Timestamp

def main():
    df = pd.DataFrame(
    {
            "sin_order_1_year": [-0.7418799885470463, -0.8171209666969853, -0.8805057639294221],
            "date": [
                Timestamp("2022-08-15 00:00:00"),
                Timestamp("2022-08-22 00:00:00"),
                Timestamp("2022-08-29 00:00:00"),
            ],
        },
    )
    df = df.astype("Float64", errors="ignore")

    mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array()


if __name__ == "__main__":
    main()

still failing in 2024.11.0 with this error

Traceback (most recent call last):
  File "C:\Users\me\git\xarray_test\main.py", line 22, in <module>
    main()
    ~~~~^^
  File "C:\Users\me\git\xarray_test\main.py", line 18, in main
    mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array()
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\dataset.py", line 7374, in to_array
    return self.to_dataarray(dim=dim, name=name)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\dataset.py", line 7357, in to_dataarray
    data = duck_array_ops.stack([b.data for b in broadcast_vars], axis=0)
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\duck_array_ops.py", line 397, in stack
    return xp.stack(as_shared_dtype(arrays, xp=xp), axis=axis)
           ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\extension_array.py", line 100, in __array_function__
    res = HANDLED_EXTENSION_ARRAY_FUNCTIONS[func](*args, **kwargs)
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\extension_array.py", line 41, in __extension_duck_array__stack
    raise NotImplementedError("Cannot stack 1d-only pandas categorical array.")
NotImplementedError: Cannot stack 1d-only pandas categorical array.

ilan-gold · 2025-02-04T10:36:48Z

@nataziel That makes sense to me - you're trying to make a numpy array out of heterogeneous extension array types, no? You tell the dataframe "give me Float64" which converts the sin_order_1_year to a numpy extension array instead of just a numpy array (see https://pandas.pydata.org/docs/dev/reference/api/pandas.Float64Dtype.html) - you can see this by calling df["sin_order_1_year"].values before and after your astype call.

I would convert those two "columns" (or the equivalent on the xarray object) to something that is internally-coherent and numpy-compatible and then call to_array (so not calling astype). The upshot of this whole extension array stuff is that your xarray object keeps its types (hooray). But the downside is that you now have to manage this stuff explicitly (sad) instead of xarray blackmagic. But that means things are easier for maintainers to reason about (hooray).

nataziel · 2025-02-05T02:07:20Z

@ilan-gold it's a contrived example for reproducibility, but the DataFrame is similar to something that was naturally produced in my codebase. I use the np.sin and np.cos functions to generate some arrays, which are of type Float64, then throw them into DataFrames. In a weird coincidence I was working on it yesterday before seeing your reply and fixed it on my end pretty much exactly as you described by manually wrangling the types. I see now and agree that it's not due to the datetime index but the dtype of the columns. That said, the reproducible error still exists.

If I have a Dataset, full of DataArrays with exclusively Float64 dtypes, I'd expect to be able to convert that to an array with .to_dataarray() or .to_array(). The example code works in xarray == "2024.9.0", but doesn't in versions up to 2025.01.2. I acknowledge that this just might not be a supported use case and there are workarounds, but at the very least it's a change in behaviour.

Complete speculation without looking at the xarray internals but maybe you could check if the xarray dtypes to be stacked are homogenous and allow stacking in that case? The error message stating that it's a categorical array is a bit strange also

Tests failing with newer xarray and/or pyreadstat Author: Richard Shadrach, Rebecca N. Palmer <rebecca_palmer@zoho.com> Origin: partly pandas-dev/pandas#60109 Bug: partly pydata/xarray#9661 Bug-Debian: https://bugs.debian.org/1088988 Forwarded: no Gbp-Pq: Name 1088988_xarray_pyreadstat_compat.patch

ilan-gold · 2025-02-05T09:32:27Z

@nataziel To me it seems the previous behavior was probably not optimal (or even really sensible - I would not think a column name should become a coordinate).

<xarray.DataArray (variable: 1, date: 3)>
array([[-0.74187999, -0.81712097, -0.88050576]])
Coordinates:
  * date      (date) datetime64[ns] 2022-08-15 2022-08-22 2022-08-29
  * variable  (variable) object 'sin_order_1_year'

It's tough to say why this worked but it definitely feels wrong

nataziel · 2025-02-05T23:09:22Z

The column name becoming the coordinate is the expected result no? In the 'fixed' version I'm using now with float64 arrays instead of Float64 arrays it works exactly that way. Dirty reproduction with xarray == '2025.01.2':

>>> import pandas as pd
>>> import xarray as xr
>>> import numpy as np
>>> arrays = np.sin([5, 3]*2)
>>> arrays
array([-0.95892427,  0.14112001, -0.95892427,  0.14112001])
>>> array2 = np.sin([9,5]*2)
>>> array2
array([ 0.41211849, -0.95892427,  0.41211849, -0.95892427])
>>> df = pd.DataFrame({'col1': arrays, 'col2': array2})
>>> df
       col1      col2
0 -0.958924  0.412118
1  0.141120 -0.958924
2 -0.958924  0.412118
3  0.141120 -0.958924
>>> df.dtypes
col1    float64
col2    float64
dtype: object
>>> dset = xr.Dataset.from_dataframe(df)
>>> dset
<xarray.Dataset> Size: 96B
Dimensions:  (index: 4)
Coordinates:
  * index    (index) int64 32B 0 1 2 3
Data variables:
    col1     (index) float64 32B -0.9589 0.1411 -0.9589 0.1411
    col2     (index) float64 32B 0.4121 -0.9589 0.4121 -0.9589
>>> dset.dtypes
Frozen({'col1': dtype('float64'), 'col2': dtype('float64')})
>>> darray = dset.to_array()
>>> darray
<xarray.DataArray (variable: 2, index: 4)> Size: 64B
array([[-0.95892427,  0.14112001, -0.95892427,  0.14112001],
       [ 0.41211849, -0.95892427,  0.41211849, -0.95892427]])
Coordinates:
  * index     (index) int64 32B 0 1 2 3
  * variable  (variable) object 16B 'col1' 'col2'
>>> darray.dtype
dtype('float64')

ilan-gold · 2025-02-07T09:51:27Z

Ok @nataziel I see what that function does now. I leave it up to the maintainers @shoyer etc. to decide what to do, if extension arrays should be cast to numpy first in the case of a to_array call or if they should error out as they do now pushing that decision to the user.

The fact that this used to work was a symptom of a bug I also fixed: e649e13 in which we were not preserving correct pandas dtypes. But your issue is a different API - should to_array coerce pandas dtypes or not? That is not really up to me IMO. If anything, I would lean towards "no" given the fact that to_array is deprecated and it is now the more explicit to_dataarray in which case this function (despite its previous name) has no mention of numpy casting.

shoyer · 2025-02-07T18:10:37Z

I agree that xarray should probably be converting pandas's Float64 into np.float64. Xarray uses nan for missing values in array operations, so mixing that with NA from pandas is likely to lead to confusion.

It's less clear to me if we should automatically convert Int64. Potenitally we could check if there are any missing values before deciding the NumPy dtype, but that is a somewhat expensive operation.

nataziel · 2025-02-09T13:00:57Z

I think we might be talking about different things here and I'm describing a different problem. To be 100% clear, the problem I'm describing is that I have a Dataset with homogenous PandasExtensionArray types across the contained DataArrays and since xarray==2024.10.0 turning that DataSet into a DataArray with DataSet.to_dataarray() fails.

import pandas as pd
import numpy as np
import xarray as xr


def main():
    df1 = pd.DataFrame(
        {
            "val1": [1, 2, 3],
            "val2": [4, 5, 6],
        }
    )

    print(df1)
    print(df1.dtypes)

    dset1 = xr.Dataset.from_dataframe(df1)

    print(dset1)
    print(dset1.dtypes)  # int64

    darray1 = dset1.to_dataarray()  # works fine

    df2 = df1.astype("Int64")

    print(df2.dtypes)

    dset2 = xr.Dataset.from_dataframe(df2)

    print(dset2) # Pandas "Int64" extension arrays
    print(dset2.dtypes) 

    darray2 = dset2.to_dataarray()  # error here


if __name__ == "__main__":
    main()

How I'd like this to work is that if the DataSet contains homogenous types of DataArrays then no type conversion occurs and they just get stacked together. This worked before 2024.10.0

ilan-gold · 2025-02-11T12:44:49Z

@shoyer Why the difference in approach between the two? It seems like both could have NA in pandas, no?

Maybe to_dataarray could take a keyword arg like homogenize where the promise is that we will just put everything together (maybe subject to numpy type promotion rules), and this arg is default False. I think being super clear about data types and behavior is good.

If there's a precedent i.e., some other function for "xarray will take your data and convert to numpy without any indication", then maybe doing this by default is ok.

Tests failing with newer xarray and/or pyreadstat Author: Richard Shadrach, Rebecca N. Palmer <rebecca_palmer@zoho.com> Origin: partly pandas-dev/pandas#60109 Bug: partly pydata/xarray#9661 Bug-Debian: https://bugs.debian.org/1088988 Forwarded: no Gbp-Pq: Name 1088988_xarray_pyreadstat_compat.patch

shoyer added the bug label Oct 22, 2024

ilan-gold mentioned this issue Oct 23, 2024

(fix): allow all extension array data types in pandas adapters kmuehlbauer/xarray#1

Open

4 tasks

ilan-gold mentioned this issue Oct 24, 2024

(fix): extension array indexers #9671

Open

4 tasks

rhshadrach mentioned this issue Oct 26, 2024

CI/TST: Update pyreadstat tests and pin xarray on CI pandas-dev/pandas#60109

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661

Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661

shoyer commented Oct 22, 2024

shoyer commented Oct 22, 2024

shoyer commented Oct 22, 2024 •

edited by TomNicholas

Loading

ilan-gold commented Oct 23, 2024 •

edited

Loading

ilan-gold commented Oct 23, 2024

dcherian commented Oct 23, 2024 •

edited

Loading

nataziel commented Oct 25, 2024

ilan-gold commented Oct 25, 2024

nataziel commented Oct 25, 2024 •

edited

Loading

nataziel commented Dec 6, 2024 •

edited

Loading

ilan-gold commented Feb 4, 2025 •

edited

Loading

nataziel commented Feb 5, 2025 •

edited

Loading

ilan-gold commented Feb 5, 2025

nataziel commented Feb 5, 2025 •

edited

Loading

ilan-gold commented Feb 7, 2025

shoyer commented Feb 7, 2025

nataziel commented Feb 9, 2025 •

edited

Loading

ilan-gold commented Feb 11, 2025

Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661

Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661

Comments

shoyer commented Oct 22, 2024

shoyer commented Oct 22, 2024

shoyer commented Oct 22, 2024 • edited by TomNicholas Loading

ilan-gold commented Oct 23, 2024 • edited Loading

ilan-gold commented Oct 23, 2024

dcherian commented Oct 23, 2024 • edited Loading

nataziel commented Oct 25, 2024

ilan-gold commented Oct 25, 2024

nataziel commented Oct 25, 2024 • edited Loading

nataziel commented Dec 6, 2024 • edited Loading

ilan-gold commented Feb 4, 2025 • edited Loading

nataziel commented Feb 5, 2025 • edited Loading

ilan-gold commented Feb 5, 2025

nataziel commented Feb 5, 2025 • edited Loading

ilan-gold commented Feb 7, 2025

shoyer commented Feb 7, 2025

nataziel commented Feb 9, 2025 • edited Loading

ilan-gold commented Feb 11, 2025

shoyer commented Oct 22, 2024 •

edited by TomNicholas

Loading

ilan-gold commented Oct 23, 2024 •

edited

Loading

dcherian commented Oct 23, 2024 •

edited

Loading

nataziel commented Oct 25, 2024 •

edited

Loading

nataziel commented Dec 6, 2024 •

edited

Loading

ilan-gold commented Feb 4, 2025 •

edited

Loading

nataziel commented Feb 5, 2025 •

edited

Loading

nataziel commented Feb 5, 2025 •

edited

Loading

nataziel commented Feb 9, 2025 •

edited

Loading