-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661
Comments
Here's the error message from pandas's
|
cc @ilan-gold |
On it! More generally @shoyer with this extension array stuff, I would be happy for a zoom call to go over what all the various pandas adapters in the codebase (since I think they can be somewhat cut down as a lot of the code has to do with numpy conversion) and/or sound out running the pandas integration tests in this repo. We are doing that now: https://github.com/scverse/integration-testing/pull/1/files where we check out everyone's repo and then test it against the core data structure on |
@shoyer This issue is too tied up with datetimes, see: #9618. I will need to redo what I've done to work off that branch now. The issue is that pandas>2.0 has their datetime handling as extension arrays - so if we start letting in categorical indices in our indexing adapter, we let everything in, which means we break almost all converting of the datetime stuff. |
Can we explicitly cast |
This is definitely causing problems on v2024.10.0, I'm now getting an error when going from DataFrame -> DataSet -(error here)> DataArray. I'm starting with a DataFrame with a DateTime index and 20ish columns. Relevant parts of the error trace:
|
df = pd.DataFrame(
{
"sin_order_1_year": [-0.7418799885470463, -0.8171209666969853, -0.8805057639294221],
"date": [
Timestamp("2022-08-15 00:00:00"),
Timestamp("2022-08-22 00:00:00"),
Timestamp("2022-08-29 00:00:00"),
],
},
)
df = df.astype("Float64", errors="ignore")
mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array() the above works in |
I was just testing the above with xarray==2024.11.0 and realised it's not as easily reproducible as it could be so try this: import pandas as pd
import xarray as xr
from pandas import Timestamp
def main():
df = pd.DataFrame(
{
"sin_order_1_year": [-0.7418799885470463, -0.8171209666969853, -0.8805057639294221],
"date": [
Timestamp("2022-08-15 00:00:00"),
Timestamp("2022-08-22 00:00:00"),
Timestamp("2022-08-29 00:00:00"),
],
},
)
df = df.astype("Float64", errors="ignore")
mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array()
if __name__ == "__main__":
main() still failing in
|
@nataziel That makes sense to me - you're trying to make a numpy array out of heterogeneous extension array types, no? You tell the dataframe "give me Float64" which converts the I would convert those two "columns" (or the equivalent on the xarray object) to something that is internally-coherent and numpy-compatible and then call |
@ilan-gold it's a contrived example for reproducibility, but the DataFrame is similar to something that was naturally produced in my codebase. I use the If I have a Complete speculation without looking at the xarray internals but maybe you could check if the xarray dtypes to be stacked are homogenous and allow stacking in that case? The error message stating that it's a |
Tests failing with newer xarray and/or pyreadstat Author: Richard Shadrach, Rebecca N. Palmer <rebecca_palmer@zoho.com> Origin: partly pandas-dev/pandas#60109 Bug: partly pydata/xarray#9661 Bug-Debian: https://bugs.debian.org/1088988 Forwarded: no Gbp-Pq: Name 1088988_xarray_pyreadstat_compat.patch
@nataziel To me it seems the previous behavior was probably not optimal (or even really sensible - I would not think a column name should become a coordinate). <xarray.DataArray (variable: 1, date: 3)>
array([[-0.74187999, -0.81712097, -0.88050576]])
Coordinates:
* date (date) datetime64[ns] 2022-08-15 2022-08-22 2022-08-29
* variable (variable) object 'sin_order_1_year' It's tough to say why this worked but it definitely feels wrong |
The column name becoming the coordinate is the expected result no? In the 'fixed' version I'm using now with >>> import pandas as pd
>>> import xarray as xr
>>> import numpy as np
>>> arrays = np.sin([5, 3]*2)
>>> arrays
array([-0.95892427, 0.14112001, -0.95892427, 0.14112001])
>>> array2 = np.sin([9,5]*2)
>>> array2
array([ 0.41211849, -0.95892427, 0.41211849, -0.95892427])
>>> df = pd.DataFrame({'col1': arrays, 'col2': array2})
>>> df
col1 col2
0 -0.958924 0.412118
1 0.141120 -0.958924
2 -0.958924 0.412118
3 0.141120 -0.958924
>>> df.dtypes
col1 float64
col2 float64
dtype: object
>>> dset = xr.Dataset.from_dataframe(df)
>>> dset
<xarray.Dataset> Size: 96B
Dimensions: (index: 4)
Coordinates:
* index (index) int64 32B 0 1 2 3
Data variables:
col1 (index) float64 32B -0.9589 0.1411 -0.9589 0.1411
col2 (index) float64 32B 0.4121 -0.9589 0.4121 -0.9589
>>> dset.dtypes
Frozen({'col1': dtype('float64'), 'col2': dtype('float64')})
>>> darray = dset.to_array()
>>> darray
<xarray.DataArray (variable: 2, index: 4)> Size: 64B
array([[-0.95892427, 0.14112001, -0.95892427, 0.14112001],
[ 0.41211849, -0.95892427, 0.41211849, -0.95892427]])
Coordinates:
* index (index) int64 32B 0 1 2 3
* variable (variable) object 16B 'col1' 'col2'
>>> darray.dtype
dtype('float64') |
Ok @nataziel I see what that function does now. I leave it up to the maintainers @shoyer etc. to decide what to do, if extension arrays should be cast to numpy first in the case of a The fact that this used to work was a symptom of a bug I also fixed: e649e13 in which we were not preserving correct pandas dtypes. But your issue is a different API - should |
I agree that xarray should probably be converting pandas's It's less clear to me if we should automatically convert |
I think we might be talking about different things here and I'm describing a different problem. To be 100% clear, the problem I'm describing is that I have a Dataset with homogenous PandasExtensionArray types across the contained DataArrays and since import pandas as pd
import numpy as np
import xarray as xr
def main():
df1 = pd.DataFrame(
{
"val1": [1, 2, 3],
"val2": [4, 5, 6],
}
)
print(df1)
print(df1.dtypes)
dset1 = xr.Dataset.from_dataframe(df1)
print(dset1)
print(dset1.dtypes) # int64
darray1 = dset1.to_dataarray() # works fine
df2 = df1.astype("Int64")
print(df2.dtypes)
dset2 = xr.Dataset.from_dataframe(df2)
print(dset2) # Pandas "Int64" extension arrays
print(dset2.dtypes)
darray2 = dset2.to_dataarray() # error here
if __name__ == "__main__":
main() How I'd like this to work is that if the DataSet contains homogenous types of DataArrays then no type conversion occurs and they just get stacked together. This worked before |
@shoyer Why the difference in approach between the two? It seems like both could have NA in pandas, no? Maybe If there's a precedent i.e., some other function for " |
Tests failing with newer xarray and/or pyreadstat Author: Richard Shadrach, Rebecca N. Palmer <rebecca_palmer@zoho.com> Origin: partly pandas-dev/pandas#60109 Bug: partly pydata/xarray#9661 Bug-Debian: https://bugs.debian.org/1088988 Forwarded: no Gbp-Pq: Name 1088988_xarray_pyreadstat_compat.patch
It appears that #9520 may have broken some upstream pandas tests, specifically testing round-trips with various index types:
https://github.com/pandas-dev/pandas/blob/e78ebd3f845c086af1d71c0604701ec49df97228/pandas/tests/generic/test_to_xarray.py#L32
Here's a minimal test case:
I'm not sure if this is a pandas or xarray issue, but it's one or the other!
(My guess is that most of these tests in pandas should probably live in xarray instead, given that we implement all the conversion logic.)
Originally posted by @shoyer in #9520 (comment)
The text was updated successfully, but these errors were encountered: