-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Describe the bug, including details regarding any error messages, version, and platform.
Code example
from datetime import datetime
import pyarrow as pa
string = pa.array(["a", "bc", None])
datetime = pa.array([datetime(2022, 1, 1), datetime(2022, 1, 2), datetime(2022, 1, 3)])
categorical = pa.DictionaryArray.from_arrays(pa.array([0, 1, 0]), pa.array(["a", "b"]))
df = pa.Table.from_arrays(
[string, datetime, categorical],
names=["string", "datetime", "categorical"],
)
dfi = df.__dataframe__()
col = dfi.get_column_by_name("string")
print(col.dtype)
# (<DtypeKind.STRING: 21>, 8, 'u', '=')
print(col.get_buffers()["data"][1])
# (<DtypeKind.STRING: 21>, 8, 'u', '=') -> SHOULD BE: (<DtypeKind.UINT: 1>, 8, 'C', '=')
col = dfi.get_column_by_name("datetime")
print(col.dtype)
# (<DtypeKind.DATETIME: 22>, 64, 'tsu:', '=')
print(col.get_buffers()["data"][1])
# (<DtypeKind.DATETIME: 22>, 64, 'tsu:', '=') -> SHOULD BE: (<DtypeKind.INT: 0>, 64, 'l', '=')
col = dfi.get_column_by_name("categorical")
print(col.dtype)
# (<DtypeKind.CATEGORICAL: 23>, 64, 'L', '=')
print(col.get_buffers()["data"][1])
# (<DtypeKind.INT: 0>, 64, 'l', '=') -> CORRECT!
Issue description
As you can see, the dtype of the data buffer is the same as the dtype of the column. This is only correct for integers and floats. Categoricals, strings, and datetime types have a some integer as their physical representation. The data buffer should have this physical data type associated with it.
The dtype of the Column object should provide information on how to interpret the various buffers. The dtype associated with each buffer should be the dtype of the actual data in that buffer. This is the second part of the issue: the implementation of from_dataframe
is incorrect - it should use the column dtype rather than the data buffer dtype.
Fix
Fixing the get_buffers
implementation should be relatively simple. However, this will break any from_dataframe
implementation (also from other libraries) that rely on the data buffer having the column dtype.
So fixing this should ideally go in three steps:
- Fix the
from_dataframe
implementation to use the column dtype rather than the data buffer dtype to interpret the buffers. - Make sure other libraries have also updated their
from_dataframe
implementation. See BUG: Interchange object data buffer has the wrong dtype /from_dataframe
incorrect pandas-dev/pandas#54781 for the pandas issue. - Fix the data buffer dtypes.
Tagging @AlenkaF as I know you've been working on the protocol for pyarrow.
Component(s)
Python