-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace pandas.DataFrame with PyArrow.Table for nullable int typing #8733
Conversation
One last issue with |
@robdiciuccio an issue was reported recently at Airbnb where nullable booleans weren’t being reported correctly, i.e., Is this something which will be resolved by using |
@john-bodley I'm hoping to wrap up the remaining timezone issue on this PR later this weekend or Monday (I'm out today). |
{"col1": 1, "col2": 1, "col3": pd.Timestamp("2017-10-19 23:39:16.660000")}, | ||
) | ||
|
||
def test_mssql_engine_spec_odbc(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored and added some additional tests around this. Should be good to go.
return 100 * success / total | ||
|
||
@staticmethod | ||
def is_date(np_dtype, db_type_str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mistercrunch curious on your opinion of the date detection changes, mainly because I don't understand the context behind #5634
superset/dataframe.py
Outdated
# TODO: refactor this | ||
for d in data: | ||
for k, v in list(d.items()): | ||
# if an int is too big for Java Script to handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: JavaScript
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit pending
Codecov Report
@@ Coverage Diff @@
## master #8733 +/- ##
=======================================
Coverage 58.97% 58.97%
=======================================
Files 359 359
Lines 11333 11333
Branches 2787 2787
=======================================
Hits 6684 6684
Misses 4471 4471
Partials 178 178 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great to to be moving from pandas to pyarrow 👍 Apologies for the nit-fest!
superset/dataframe.py
Outdated
column.pop("agg", None) | ||
columns.append(column) | ||
return columns | ||
def df_to_dict(dframe: pd.DataFrame) -> Dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: similar to how pd
generally refers to Pandas
, df
commonly refers to DataFrame
(as in the method name). Also add types to Dict, e.g. Dict[str, Any]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using dframe
here due to a linter complaining about df
as an argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reference to my comment below about df_to_dict
being expected to return a dict
, shouldn't this in fact return a List[Dict[str, Any]]
? Looks like mypy missed this as data
wasn't typed below, and is hence implicitly typed Any
.
superset/dataframe.py
Outdated
if isinstance(v, int): | ||
if abs(v) > JS_MAX_INTEGER: | ||
d[k] = str(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isintance(v, int) abs(v) > JS_MAX_INTEGER:
d[k] = str(v)
superset/table.py
Outdated
return new_l | ||
|
||
|
||
class SupersetTable(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
py3 nit: remove (object)
superset/sql_lab.py
Outdated
selected_columns: list = result_table.columns | ||
expanded_columns: list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm guessing these are List[str]
superset/sql_lab.py
Outdated
# expand when loading data from results backend | ||
all_columns, expanded_columns = (selected_columns, []) | ||
else: | ||
data = cdf.data or [] | ||
df = result_table.to_pandas_df() | ||
data = df_to_dict(df) or [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be df_to_dict(df) or {}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One would think, but this is calling to_dict(orient="records")
which returns list of dicts:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be renamed df_to_dicts
or df_to_records
for clarity
superset/table.py
Outdated
return None | ||
|
||
@staticmethod | ||
def convert_pa_dtype(pa_dtype: pa.DataType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type Optional[str]
superset/table.py
Outdated
return "DATETIME" | ||
return None | ||
|
||
def to_pandas_df(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> pd.DataFrame
superset/table.py
Outdated
return self.pa_table_to_df(self.table) | ||
|
||
@property | ||
def pa_table(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> pa.Table
?
superset/table.py
Outdated
return self.table | ||
|
||
@property | ||
def size(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> int
superset/table.py
Outdated
return self.table.num_rows | ||
|
||
@property | ||
def columns(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> List[Dict[str, Any]]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few last minor nits, otherwise LGTM
superset/dataframe.py
Outdated
column.pop("agg", None) | ||
columns.append(column) | ||
return columns | ||
def df_to_dict(dframe: pd.DataFrame) -> Dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reference to my comment below about df_to_dict
being expected to return a dict
, shouldn't this in fact return a List[Dict[str, Any]]
? Looks like mypy missed this as data
wasn't typed below, and is hence implicitly typed Any
.
superset/dataframe.py
Outdated
# TODO: refactor this | ||
for d in data: | ||
for k, v in list(d.items()): | ||
# if an int is too big for Java Script to handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit pending
Profiling info added in the description (good news!). Unless there are any additional concerns (particularly around date detection) or testing against additional analytics databases, I think this should be ready to go. |
Spoke too soon, investigating some dashboard discrepancies with the example data. |
Dashboard discrepancies were due to stale cache. Retested dashboards and SQL Lab with fresh example data in postgres and mysql. LGTM. |
@mistercrunch @willbarrett I propose merging this; any comments or good to go? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 super minor comments, otherwise LGTM. It seems like type detection that relied on pandas magic in the past could be affected, but there's really no way to tell.
superset/table.py
Outdated
return new_l | ||
|
||
|
||
class SupersetTable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Table" can be confusing, how about SupersetResultSet
superset/sql_lab.py
Outdated
# expand when loading data from results backend | ||
all_columns, expanded_columns = (selected_columns, []) | ||
else: | ||
data = cdf.data or [] | ||
df = result_table.to_pandas_df() | ||
data = df_to_dict(df) or [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be renamed df_to_dicts
or df_to_records
for clarity
…typing (apache#8733)" This reverts commit 6537d5e
Hi @robdiciuccio I am trying to deploy this feature to airbnb production. But I got many errors from Presto queries: The full stack trace:
So i have to revert this feature from airbnb's release branch. Please take a look. We should also consider revert this PR from master branch. |
@graceguo-supercat thanks for flagging. This should be fixed in #8946 |
…typing (apache#8733)" This reverts commit 6537d5e.
…typing (apache#8733)" This reverts commit 6537d5e.
CATEGORY
Choose one
SUMMARY
Load raw query result data into a PyArrow.Table structure, which handles nullable integer columns correctly. Convert Table to pandas.DataFrame when necessary for data manipulation.
Fixes #8225
Note: this PR also (re)sets the default configuration
RESULTS_BACKEND_USE_MSGPACK = True
based on the benchmarks below, and the coupling withPyArrow
.TODO
.columns
method toSupersetTable
dtype
logic fromPrestoEngineSpec
TEST PLAN
ADDITIONAL INFORMATION
REVIEWERS
@betodealmeida @john-bodley
BENCHMARKS (2019-12-30)
All metrics below are averages over at least three runs on a local Superset installation running with a Postgres metadata and analytics DB (Macbook Pro 2.6 GHz, 32GB). The queries run here are selecting 100K rows from the birth_names table in the example datasets.
SupersetTable
object in SQL Lab was slightly faster thanSupersetDataFrame
at 99.72444661ms vs 124.0594076ms, respectively.master
, serialization/deserialization of the dataframe viajson
takes an average of ~7750msmsgpack
disabled) reduces this cycle to ~2850msmsgpack
in the serialization workflow further reduces this to ~1700ms, representing a ~78% performance improvement.