Pandas casting int64 to float64, misrepresenting value #8225

betodealmeida · 2019-09-14T20:11:37Z

I have the following data being returned by Presto (single column, 6 rows):

[(None,), (1239162456494753670,), (None,), (None,), (None,), (None,)

Due to the missing data (None), Pandas infers the type as float64, converting the value to a wrong id:

>>> column_names = ['organization_lyft_id']
>>> data = [(None,), (1239162456494753670,), (None,), (None,), (None,), (None,)]
>>> df = pd.DataFrame(list(data), columns=column_names).infer_objects()  # SupersetDataFrame
>>> print(df)
   organization_lyft_id
0                   NaN
1          1.239162e+18
2                   NaN
3                   NaN
4                   NaN
5                   NaN
>>> print(df.dtypes)
organization_lyft_id    float64
dtype: object

The number then shows up as 1239162456494753800 in SQL Lab.

Here's the Pandas documentation on this:

... pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers. (emphasis mine)

Note that if the missing data is filtered the value is inferred as an int64, and it shows up correctly in SQL Lab:

>>> column_names = ['organization_lyft_id']
>>> data = [(1239162456494753670,)]
>>> df = pd.DataFrame(list(data), columns=column_names).infer_objects()  # SupersetDataFrame
>>> print(df)
   organization_lyft_id
0   1239162456494753670
>>> print(df.dtypes)
organization_lyft_id    int64
dtype: object

The solution is to pass a dtype argument when creating the Pandas data frame, built from the cursor description. I'm working on a fix for this.

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2019-09-14T20:11:39Z

Issue-Label Bot is automatically applying the label #bug to this issue, with a confidence of 0.57. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

villebro · 2019-09-14T21:02:54Z

I've been wrestling with something similar lately (unrelated data wrangling), and ended up having to bypass pandas completely, as having Nones in columns messed up the dtypes (got exceptions when trying to force them in afterwards). Not sure if this will be solvable with the current stable pandas version (apparently they introduced better nulling support in 0.24), but keeping fingers crossed these things get better support soon (guessing arrow will replace pandas in the long term for this type of data processing).

betodealmeida · 2019-09-14T21:30:59Z

I was able to fix this by passing a dtype constructed based on the cursor description, but then PyArrow fails to serialize the resulting Pandas dataframe, sigh:

apache/arrow#4168

robdiciuccio · 2019-09-30T21:48:09Z

This is still an issue for non-Presto databases.

robdiciuccio · 2019-11-15T05:56:07Z

Support for PyArrow serialization of Pandas Int64 dtypes is currently merged to master in both repos, but not yet released on PyPi:

pandas-dev/pandas@34fff1f
apache/arrow@7f4165c

Also requires converting the pandas Dataframe to an arrow Table prior to serialization:

table = pa.Table.from_pandas(cdf.raw_df)
data = (
   pa.default_serialization_context()
   .serialize(table)
   .to_buffer()
   .to_pybytes()
)

john-bodley · 2019-12-03T17:54:13Z

We noticed an issue with the the Numpy reshaping logic in SQL Lab. Here labels is an ARRAY<STRING> and renders correctly if multiple columns are selected but it incorrectly reshaped if it's the only column.

issue-label-bot bot added the !deprecated-label:bug Deprecated label - Use #bug instead label Sep 14, 2019

betodealmeida mentioned this issue Sep 14, 2019

Handle int64 columns with missing data in SQL Lab #8226

Merged

12 tasks

betodealmeida closed this as completed in #8226 Sep 17, 2019

mistercrunch reopened this Sep 30, 2019

mistercrunch added the .pinned Draws attention label Sep 30, 2019

robdiciuccio mentioned this issue Oct 19, 2019

pyarrow does not know how to serialize objects of type #8396

Closed

3 tasks

robdiciuccio mentioned this issue Nov 15, 2019

[SECURITY] Bump pyarrow to 0.15.1 due to CVE #8583

Merged

12 tasks

robdiciuccio mentioned this issue Dec 3, 2019

Replace pandas.DataFrame with PyArrow.Table for nullable int typing #8733

Merged

15 tasks

mistercrunch closed this as completed in #8733 Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas casting int64 to float64, misrepresenting value #8225

Pandas casting int64 to float64, misrepresenting value #8225

betodealmeida commented Sep 14, 2019 •

edited

Loading

issue-label-bot bot commented Sep 14, 2019

villebro commented Sep 14, 2019 •

edited

Loading

betodealmeida commented Sep 14, 2019

robdiciuccio commented Sep 30, 2019

robdiciuccio commented Nov 15, 2019

john-bodley commented Dec 3, 2019

Pandas casting int64 to float64, misrepresenting value #8225

Pandas casting int64 to float64, misrepresenting value #8225

Comments

betodealmeida commented Sep 14, 2019 • edited Loading

issue-label-bot bot commented Sep 14, 2019

villebro commented Sep 14, 2019 • edited Loading

betodealmeida commented Sep 14, 2019

robdiciuccio commented Sep 30, 2019

robdiciuccio commented Nov 15, 2019

john-bodley commented Dec 3, 2019

betodealmeida commented Sep 14, 2019 •

edited

Loading

villebro commented Sep 14, 2019 •

edited

Loading