-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Ported over from #2217
In [8]: import pandas as pd
...: import pyarrow as arw
In [9]: df = pd.DataFrame({'A': list('abc'), 'B': np.arange(3)})
...: df
Out[9]:
A B
0 a 0
1 b 1
2 c 2
In [10]: schema = arw.schema([
...: arw.field('A', arw.string()),
...: arw.field('B', arw.int32()),
...: ])
In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
...: tbl
Out[11]:
pyarrow.Table
A: string
B: int32
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
b'pandas_type": "int32", "numpy_type": "int32", "metadata": null}]'
b', "pandas_version": "0.23.1"}'}
In [12]: tbl.to_pandas().equals(df)
Out[12]: True...so if the schema matches the pandas datatypes all is well - we can roundtrip the DataFrame.
Now, say we have some bad data such that column 'B' is now of type float64. The datatypes of the DataFrame don't match the explicitly supplied schema object but rather than raising a TypeError the data is silently truncated and the roundtrip DataFrame doesn't match our input DataFame without even a warning raised!
In [13]: df['B'].iloc[0] = 1.23
...: df
Out[13]:
A B
0 a 1.23
1 b 1.00
2 c 2.00
In [14]: # I would expect/want this to raise a TypeError since the schema doesn't match the pandas datatypes
...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
...: tbl
Out[14]:
pyarrow.Table
A: string
B: int32
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
b'pandas_type": "int32", "numpy_type": "float64", "metadata": null'
b'}], "pandas_version": "0.23.1"}'}
In [15]: tbl.to_pandas() # <-- SILENT TRUNCATION!!!
Out[15]:
A B
0 a 1
1 b 1
2 c 2To be clear, I would really like Table.from_pandas to raise a TypeError if the DataFrame types don't match an explicitly supplied schema and would hope this current behaviour would be considered a bug.
Reporter: Dave Hirschfeld / @dhirschfeld
Assignee: Krisztian Szucs / @kszucs
Related issues:
- [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts (depends upon)
- [C++] Handle float truncation during casting (depends upon)
PRs and other links:
Note: This issue was originally created as ARROW-2799. Please see the migration documentation for further details.