Skip to content

[Python] Interchange object data buffer has the wrong dtype / from_dataframe incorrect #37598

@stinodego

Description

@stinodego

Describe the bug, including details regarding any error messages, version, and platform.

Code example

from datetime import datetime
import pyarrow as pa

string = pa.array(["a", "bc", None])
datetime = pa.array([datetime(2022, 1, 1), datetime(2022, 1, 2), datetime(2022, 1, 3)])
categorical = pa.DictionaryArray.from_arrays(pa.array([0, 1, 0]), pa.array(["a", "b"]))

df = pa.Table.from_arrays(
    [string, datetime, categorical],
    names=["string", "datetime", "categorical"],
)

dfi = df.__dataframe__()

col = dfi.get_column_by_name("string")
print(col.dtype)
# (<DtypeKind.STRING: 21>, 8, 'u', '=')
print(col.get_buffers()["data"][1])
# (<DtypeKind.STRING: 21>, 8, 'u', '=') -> SHOULD  BE: (<DtypeKind.UINT: 1>, 8, 'C', '=')

col = dfi.get_column_by_name("datetime")
print(col.dtype)
# (<DtypeKind.DATETIME: 22>, 64, 'tsu:', '=')
print(col.get_buffers()["data"][1])
# (<DtypeKind.DATETIME: 22>, 64, 'tsu:', '=') -> SHOULD BE: (<DtypeKind.INT: 0>, 64, 'l', '=')

col = dfi.get_column_by_name("categorical")
print(col.dtype)
# (<DtypeKind.CATEGORICAL: 23>, 64, 'L', '=')
print(col.get_buffers()["data"][1])
# (<DtypeKind.INT: 0>, 64, 'l', '=') -> CORRECT!

Issue description

As you can see, the dtype of the data buffer is the same as the dtype of the column. This is only correct for integers and floats. Categoricals, strings, and datetime types have a some integer as their physical representation. The data buffer should have this physical data type associated with it.

The dtype of the Column object should provide information on how to interpret the various buffers. The dtype associated with each buffer should be the dtype of the actual data in that buffer. This is the second part of the issue: the implementation of from_dataframe is incorrect - it should use the column dtype rather than the data buffer dtype.

Fix

Fixing the get_buffers implementation should be relatively simple. However, this will break any from_dataframe implementation (also from other libraries) that rely on the data buffer having the column dtype.

So fixing this should ideally go in three steps:

  1. Fix the from_dataframe implementation to use the column dtype rather than the data buffer dtype to interpret the buffers.
  2. Make sure other libraries have also updated their from_dataframe implementation. See BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect pandas-dev/pandas#54781 for the pandas issue.
  3. Fix the data buffer dtypes.

Tagging @AlenkaF as I know you've been working on the protocol for pyarrow.

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions