Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas casting int64 to float64, misrepresenting value #8225

Closed
betodealmeida opened this issue Sep 14, 2019 · 6 comments · Fixed by #8226 or #8733
Closed

Pandas casting int64 to float64, misrepresenting value #8225

betodealmeida opened this issue Sep 14, 2019 · 6 comments · Fixed by #8226 or #8733
Labels
!deprecated-label:bug Deprecated label - Use #bug instead .pinned Draws attention

Comments

@betodealmeida
Copy link
Member

betodealmeida commented Sep 14, 2019

I have the following data being returned by Presto (single column, 6 rows):

[(None,), (1239162456494753670,), (None,), (None,), (None,), (None,)

Due to the missing data (None), Pandas infers the type as float64, converting the value to a wrong id:

>>> column_names = ['organization_lyft_id']
>>> data = [(None,), (1239162456494753670,), (None,), (None,), (None,), (None,)]
>>> df = pd.DataFrame(list(data), columns=column_names).infer_objects()  # SupersetDataFrame
>>> print(df)
   organization_lyft_id
0                   NaN
1          1.239162e+18
2                   NaN
3                   NaN
4                   NaN
5                   NaN
>>> print(df.dtypes)
organization_lyft_id    float64
dtype: object

The number then shows up as 1239162456494753800 in SQL Lab.

Here's the Pandas documentation on this:

... pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers. (emphasis mine)

Note that if the missing data is filtered the value is inferred as an int64, and it shows up correctly in SQL Lab:

>>> column_names = ['organization_lyft_id']
>>> data = [(1239162456494753670,)]
>>> df = pd.DataFrame(list(data), columns=column_names).infer_objects()  # SupersetDataFrame
>>> print(df)
   organization_lyft_id
0   1239162456494753670
>>> print(df.dtypes)
organization_lyft_id    int64
dtype: object

The solution is to pass a dtype argument when creating the Pandas data frame, built from the cursor description. I'm working on a fix for this.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label #bug to this issue, with a confidence of 0.57. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the !deprecated-label:bug Deprecated label - Use #bug instead label Sep 14, 2019
@villebro
Copy link
Member

villebro commented Sep 14, 2019

I've been wrestling with something similar lately (unrelated data wrangling), and ended up having to bypass pandas completely, as having Nones in columns messed up the dtypes (got exceptions when trying to force them in afterwards). Not sure if this will be solvable with the current stable pandas version (apparently they introduced better nulling support in 0.24), but keeping fingers crossed these things get better support soon (guessing arrow will replace pandas in the long term for this type of data processing).

@betodealmeida
Copy link
Member Author

I was able to fix this by passing a dtype constructed based on the cursor description, but then PyArrow fails to serialize the resulting Pandas dataframe, sigh:

apache/arrow#4168

@robdiciuccio
Copy link
Member

This is still an issue for non-Presto databases.

@robdiciuccio
Copy link
Member

Support for PyArrow serialization of Pandas Int64 dtypes is currently merged to master in both repos, but not yet released on PyPi:

pandas-dev/pandas@34fff1f
apache/arrow@7f4165c

Also requires converting the pandas Dataframe to an arrow Table prior to serialization:

table = pa.Table.from_pandas(cdf.raw_df)
data = (
   pa.default_serialization_context()
   .serialize(table)
   .to_buffer()
   .to_pybytes()
)

@john-bodley
Copy link
Member

We noticed an issue with the the Numpy reshaping logic in SQL Lab. Here labels is an ARRAY<STRING> and renders correctly if multiple columns are selected but it incorrectly reshaped if it's the only column.

Screen Shot 2019-12-03 at 9 46 23 AM

Screen Shot 2019-12-03 at 9 45 49 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
!deprecated-label:bug Deprecated label - Use #bug instead .pinned Draws attention
Projects
None yet
5 participants