Skip to content

[Python] Add safe option to Table.from_pandas to avoid unsafe casts #19180

@asfimport

Description

@asfimport

Ported over from #2217

In [8]: import pandas as pd
   ...: import pyarrow as arw

In [9]: df = pd.DataFrame({'A': list('abc'), 'B': np.arange(3)})
   ...: df
Out[9]:
   A  B
0  a  0
1  b  1
2  c  2

In [10]: schema = arw.schema([
    ...:     arw.field('A', arw.string()),
    ...:     arw.field('B', arw.int32()),
    ...: ])

In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
    ...: tbl
Out[11]:
pyarrow.Table
A: string
B: int32
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
            b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
            b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
            b'pandas_type": "int32", "numpy_type": "int32", "metadata": null}]'
            b', "pandas_version": "0.23.1"}'}

In [12]: tbl.to_pandas().equals(df)
Out[12]: True

...so if the schema matches the pandas datatypes all is well - we can roundtrip the DataFrame.

Now, say we have some bad data such that column 'B' is now of type float64. The datatypes of the DataFrame don't match the explicitly supplied schema object but rather than raising a TypeError the data is silently truncated and the roundtrip DataFrame doesn't match our input DataFame without even a warning raised!

In [13]: df['B'].iloc[0] = 1.23
    ...: df
Out[13]:
   A     B
0  a  1.23
1  b  1.00
2  c  2.00

In [14]: # I would expect/want this to raise a TypeError since the schema doesn't match the pandas datatypes
    ...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
    ...: tbl
Out[14]:
pyarrow.Table
A: string
B: int32
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
            b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
            b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
            b'pandas_type": "int32", "numpy_type": "float64", "metadata": null'
            b'}], "pandas_version": "0.23.1"}'}

In [15]: tbl.to_pandas()  # <-- SILENT TRUNCATION!!!
Out[15]:
   A  B
0  a  1
1  b  1
2  c  2

To be clear, I would really like Table.from_pandas to raise a TypeError if the DataFrame types don't match an explicitly supplied schema and would hope this current behaviour would be considered a bug.

Reporter: Dave Hirschfeld / @dhirschfeld
Assignee: Krisztian Szucs / @kszucs

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-2799. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions