Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deephaven.pandas.to_pandas does not properly convert string columns #4810

Closed
chipkent opened this issue Nov 10, 2023 · 2 comments · Fixed by #4815
Closed

deephaven.pandas.to_pandas does not properly convert string columns #4810

chipkent opened this issue Nov 10, 2023 · 2 comments · Fixed by #4815
Assignees
Labels
bug Something isn't working python python-server-side
Milestone

Comments

@chipkent
Copy link
Member

A user reported problems with doing round trips between DH and pandas when column types are strings. I have confirmed that DH is not properly setting string column types in pandas tables. See below:

from deephaven import empty_table

a=empty_table(5).update_view('A=`value`')

meta1 = a.meta_table # java.lang.String

 
df = pd.DataFrame({'a':['a', 'b', 'c']})

print(df.dtypes)

 

from deephaven.pandas import to_table

t = to_table(df)

meta2 = t.meta_table # PyObject


# See https://pandas.pydata.org/docs/user_guide/text.html

import pandas as pd
s_a = pd.Series(["a", "b", "c"])
s_b = pd.Series(["a", "b", "c"], dtype="string")
print(s_a)
print(s_b)
df2 = pd.DataFrame({'a':s_a, 'b':s_b})
print(df2.dtypes)
t2 = to_table(df2)
meta3 = t2.meta_table

# Now look at the original case with a twist
df['b'] = df['a'].astype("string")

print(df.dtypes)
t3 = to_table(df)
meta4 = t3.meta_table

# Test DH to_pandas conversion
from deephaven.pandas import to_pandas
df_a = to_pandas(a)
print(df_a.dtypes)
@chipkent chipkent added bug Something isn't working triage python python-server-side devrel-watch DevRel team is watching labels Nov 10, 2023
@chipkent chipkent added this to the November 2023 milestone Nov 10, 2023
@chipkent
Copy link
Member Author

As a workaround, to_pandas(a, dtype_backend="numpy_nullable") can be used.

@jmao-denver
Copy link
Contributor

dtype_backend is a feature introduced only recently in Pandas 2.0. (the latest is 2.1.2) and all of our paying customers are already on 2.0+. To avoid such confusion in the future and to meet the users' expectation of data round-tripping nicely between Pandas and Deephaven, we should default this parameter to numpy_nullable. It should not be a breaking change, although there might be a performance penalty, but the benefits of clean null conversion, precise type mapping clearly outweigh here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python python-server-side
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants