Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace pandas.DataFrame with PyArrow.Table for nullable int typing #8733

Merged
merged 25 commits into from
Jan 3, 2020

Conversation

robdiciuccio
Copy link
Member

@robdiciuccio robdiciuccio commented Dec 3, 2019

CATEGORY

Choose one

  • Bug Fix
  • Enhancement (new features, refinement)
  • Refactor
  • Add tests
  • Build / Development Environment
  • Documentation

SUMMARY

Load raw query result data into a PyArrow.Table structure, which handles nullable integer columns correctly. Convert Table to pandas.DataFrame when necessary for data manipulation.

Fixes #8225

Note: this PR also (re)sets the default configuration RESULTS_BACKEND_USE_MSGPACK = True based on the benchmarks below, and the coupling with PyArrow.

TODO

  • Finish porting .columns method to SupersetTable
  • Remove explicit dtype logic from PrestoEngineSpec
  • Fix failing tests

TEST PLAN

  • Ensure correct query operations in SQL Lab and Explore views, via synchronous and async queries.
  • Test against columns including mixed integer + NULL values.

ADDITIONAL INFORMATION

REVIEWERS

@betodealmeida @john-bodley

BENCHMARKS (2019-12-30)

All metrics below are averages over at least three runs on a local Superset installation running with a Postgres metadata and analytics DB (Macbook Pro 2.6 GHz, 32GB). The queries run here are selecting 100K rows from the birth_names table in the example datasets.

  • Instantiation of the new SupersetTable object in SQL Lab was slightly faster than SupersetDataFrame at 99.72444661ms vs 124.0594076ms, respectively.
  • Memory usage for the PyArrow.Table (5659776 bytes) was significantly lower than the Pandas DataFrame (22259209 bytes).
  • Serialization for async queries is where the biggest performance gains are. PyArrow performed better in nearly all metrics, with the following standouts:
    • On master, serialization/deserialization of the dataframe via json takes an average of ~7750ms
    • Pyarrow (with msgpack disabled) reduces this cycle to ~2850ms
    • Enabling msgpack in the serialization workflow further reduces this to ~1700ms, representing a ~78% performance improvement.

@robdiciuccio
Copy link
Member Author

One last issue with datetime timezone support to address.

@john-bodley
Copy link
Member

@robdiciuccio an issue was reported recently at Airbnb where nullable booleans weren’t being reported correctly, i.e., NULL was being cast to false per here.

Is this something which will be resolved by using PyArrow?

cc: @etr2460 @graceguo-supercat

@robdiciuccio
Copy link
Member Author

@john-bodley PyArrow serialization appears to handle this case correctly. Test case added
in a6e6b79.

I'm hoping to wrap up the remaining timezone issue on this PR later this weekend or Monday (I'm out today).

@robdiciuccio robdiciuccio changed the title WIP: Replace pandas.DataFrame with PyArrow.Table for nullable int typing Replace pandas.DataFrame with PyArrow.Table for nullable int typing Dec 10, 2019
{"col1": 1, "col2": 1, "col3": pd.Timestamp("2017-10-19 23:39:16.660000")},
)

def test_mssql_engine_spec_odbc(self):
Copy link
Member Author

@robdiciuccio robdiciuccio Dec 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking when I ripped this out that it was no longer necessary due to transformations here, but it looks like additional transformation may be required. cc @chinhngt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored and added some additional tests around this. Should be good to go.

return 100 * success / total

@staticmethod
def is_date(np_dtype, db_type_str):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mistercrunch curious on your opinion of the date detection changes, mainly because I don't understand the context behind #5634

# TODO: refactor this
for d in data:
for k, v in list(d.items()):
# if an int is too big for Java Script to handle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: JavaScript

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit pending

@codecov-io
Copy link

codecov-io commented Dec 10, 2019

Codecov Report

Merging #8733 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #8733   +/-   ##
=======================================
  Coverage   58.97%   58.97%           
=======================================
  Files         359      359           
  Lines       11333    11333           
  Branches     2787     2787           
=======================================
  Hits         6684     6684           
  Misses       4471     4471           
  Partials      178      178

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b95c1f...e094e84. Read the comment docs.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great to to be moving from pandas to pyarrow 👍 Apologies for the nit-fest!

column.pop("agg", None)
columns.append(column)
return columns
def df_to_dict(dframe: pd.DataFrame) -> Dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: similar to how pd generally refers to Pandas, df commonly refers to DataFrame (as in the method name). Also add types to Dict, e.g. Dict[str, Any].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using dframe here due to a linter complaining about df as an argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reference to my comment below about df_to_dict being expected to return a dict, shouldn't this in fact return a List[Dict[str, Any]]? Looks like mypy missed this as data wasn't typed below, and is hence implicitly typed Any.

Comment on lines 33 to 35
if isinstance(v, int):
if abs(v) > JS_MAX_INTEGER:
d[k] = str(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isintance(v, int) abs(v) > JS_MAX_INTEGER:
    d[k] = str(v)

return new_l


class SupersetTable(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py3 nit: remove (object)

Comment on lines 273 to 274
selected_columns: list = result_table.columns
expanded_columns: list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm guessing these are List[str]

# expand when loading data from results backend
all_columns, expanded_columns = (selected_columns, [])
else:
data = cdf.data or []
df = result_table.to_pandas_df()
data = df_to_dict(df) or []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be df_to_dict(df) or {}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One would think, but this is calling to_dict(orient="records") which returns list of dicts:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be renamed df_to_dicts or df_to_records for clarity

return None

@staticmethod
def convert_pa_dtype(pa_dtype: pa.DataType):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type Optional[str]

return "DATETIME"
return None

def to_pandas_df(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> pd.DataFrame

return self.pa_table_to_df(self.table)

@property
def pa_table(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> pa.Table?

return self.table

@property
def size(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> int

return self.table.num_rows

@property
def columns(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> List[Dict[str, Any]]?

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few last minor nits, otherwise LGTM

column.pop("agg", None)
columns.append(column)
return columns
def df_to_dict(dframe: pd.DataFrame) -> Dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reference to my comment below about df_to_dict being expected to return a dict, shouldn't this in fact return a List[Dict[str, Any]]? Looks like mypy missed this as data wasn't typed below, and is hence implicitly typed Any.

# TODO: refactor this
for d in data:
for k, v in list(d.items()):
# if an int is too big for Java Script to handle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit pending

@robdiciuccio
Copy link
Member Author

Profiling info added in the description (good news!). Unless there are any additional concerns (particularly around date detection) or testing against additional analytics databases, I think this should be ready to go.

@robdiciuccio
Copy link
Member Author

Spoke too soon, investigating some dashboard discrepancies with the example data.

@robdiciuccio
Copy link
Member Author

Dashboard discrepancies were due to stale cache. Retested dashboards and SQL Lab with fresh example data in postgres and mysql. LGTM.

@villebro
Copy link
Member

@mistercrunch @willbarrett I propose merging this; any comments or good to go?

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 super minor comments, otherwise LGTM. It seems like type detection that relied on pandas magic in the past could be affected, but there's really no way to tell.

return new_l


class SupersetTable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Table" can be confusing, how about SupersetResultSet

# expand when loading data from results backend
all_columns, expanded_columns = (selected_columns, [])
else:
data = cdf.data or []
df = result_table.to_pandas_df()
data = df_to_dict(df) or []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be renamed df_to_dicts or df_to_records for clarity

@mistercrunch mistercrunch merged commit 6537d5e into apache:master Jan 3, 2020
@mistercrunch mistercrunch deleted the rd/pyarrow-to-pandas branch January 3, 2020 16:55
graceguo-supercat pushed a commit to graceguo-supercat/superset that referenced this pull request Jan 8, 2020
@graceguo-supercat
Copy link

graceguo-supercat commented Jan 9, 2020

Hi @robdiciuccio I am trying to deploy this feature to airbnb production. But I got many errors from Presto queries:
Unserializable object ['silver.braavos_staging.revenue_validation_detail_metrics'] of type <class 'numpy.ndarray'>
or
Not implemented type for list in DataFrameBlock: struct<author_type: string, locale: string>

The full stack trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python3.6/dist-packages/flask_appbuilder/security/decorators.py", line 168, in wraps
    return f(self, *args, **kwargs)
  File "/home/grace_guo/incubator-superset/superset/utils/log.py", line 59, in wrapper
    value = f(*args, **kwargs)
  File "/home/grace_guo/incubator-superset/superset/views/core.py", line 2291, in results
    return self.results_exec(key)
  File "/home/grace_guo/incubator-superset/superset/views/core.py", line 2342, in results_exec
    json.dumps(obj, default=utils.json_iso_dttm_ser, ignore_nan=True)
  File "/usr/local/lib/python3.6/dist-packages/simplejson/__init__.py", line 399, in dumps
    **kw).encode(obj)
  File "/usr/local/lib/python3.6/dist-packages/simplejson/encoder.py", line 296, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.6/dist-packages/simplejson/encoder.py", line 378, in iterencode
    return _iterencode(o, 0)
  File "/home/grace_guo/incubator-superset/superset/utils/core.py", line 383, in json_iso_dttm_ser
    "Unserializable object {} of type {}".format(obj, type(obj))
TypeError: Unserializable object ['silver.information_schema.tables'] of type <class 'numpy.ndarray'>

So i have to revert this feature from airbnb's release branch. Please take a look. We should also consider revert this PR from master branch.

@mistercrunch @willbarrett

@robdiciuccio
Copy link
Member Author

@graceguo-supercat thanks for flagging. This should be fixed in #8946

michellethomas pushed a commit to airbnb/superset-fork that referenced this pull request Jan 15, 2020
john-bodley pushed a commit to airbnb/superset-fork that referenced this pull request Jan 24, 2020
etr2460 pushed a commit to etr2460/incubator-superset that referenced this pull request Feb 1, 2020
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.36.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/XXL 🚢 0.36.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas casting int64 to float64, misrepresenting value
8 participants