Replace pandas.DataFrame with PyArrow.Table for nullable int typing #8733

robdiciuccio · 2019-12-03T19:15:45Z

SUMMARY

Load raw query result data into a PyArrow.Table structure, which handles nullable integer columns correctly. Convert Table to pandas.DataFrame when necessary for data manipulation.

Fixes #8225

Note: this PR also (re)sets the default configuration RESULTS_BACKEND_USE_MSGPACK = True based on the benchmarks below, and the coupling with PyArrow.

TODO

Finish porting .columns method to SupersetTable
Remove explicit dtype logic from PrestoEngineSpec
Fix failing tests

TEST PLAN

Ensure correct query operations in SQL Lab and Explore views, via synchronous and async queries.
Test against columns including mixed integer + NULL values.

ADDITIONAL INFORMATION

Has associated issue: Pandas casting int64 to float64, misrepresenting value #8225
Changes UI
Requires DB Migration.
Confirm DB Migration upgrade and downgrade tested.
Introduces new feature or API
Removes existing feature or API

REVIEWERS

@betodealmeida @john-bodley

BENCHMARKS (2019-12-30)

All metrics below are averages over at least three runs on a local Superset installation running with a Postgres metadata and analytics DB (Macbook Pro 2.6 GHz, 32GB). The queries run here are selecting 100K rows from the birth_names table in the example datasets.

Instantiation of the new SupersetTable object in SQL Lab was slightly faster than SupersetDataFrame at 99.72444661ms vs 124.0594076ms, respectively.
Memory usage for the PyArrow.Table (5659776 bytes) was significantly lower than the Pandas DataFrame (22259209 bytes).
Serialization for async queries is where the biggest performance gains are. PyArrow performed better in nearly all metrics, with the following standouts:
- On master, serialization/deserialization of the dataframe via json takes an average of ~7750ms
- Pyarrow (with msgpack disabled) reduces this cycle to ~2850ms
- Enabling msgpack in the serialization workflow further reduces this to ~1700ms, representing a ~78% performance improvement.

robdiciuccio · 2019-12-04T22:29:03Z

One last issue with datetime timezone support to address.

john-bodley · 2019-12-06T15:13:10Z

@robdiciuccio an issue was reported recently at Airbnb where nullable booleans weren’t being reported correctly, i.e., NULL was being cast to false per here.

Is this something which will be resolved by using PyArrow?

cc: @etr2460 @graceguo-supercat

robdiciuccio · 2019-12-06T18:17:06Z

@john-bodley PyArrow serialization appears to handle this case correctly. Test case added
in a6e6b79.

I'm hoping to wrap up the remaining timezone issue on this PR later this weekend or Monday (I'm out today).

superset/assets/spec/javascripts/sqllab/ExploreResultsButton_spec.jsx

robdiciuccio · 2019-12-10T22:05:27Z

tests/core_tests.py

-            {"col1": 1, "col2": 1, "col3": pd.Timestamp("2017-10-19 23:39:16.660000")},
-        )
-
-    def test_mssql_engine_spec_odbc(self):


Was thinking when I ripped this out that it was no longer necessary due to transformations here, but it looks like additional transformation may be required. cc @chinhngt

Refactored and added some additional tests around this. Should be good to go.

robdiciuccio · 2019-12-10T22:11:23Z

superset/dataframe.py

-        return 100 * success / total
-
-    @staticmethod
-    def is_date(np_dtype, db_type_str):


@mistercrunch curious on your opinion of the date detection changes, mainly because I don't understand the context behind #5634

willbarrett · 2019-12-10T22:11:23Z

superset/dataframe.py

+    # TODO: refactor this
+    for d in data:
+        for k, v in list(d.items()):
+            # if an int is too big for Java Script to handle


nit: JavaScript

Nit pending

superset/dataframe.py

codecov-io · 2019-12-10T23:20:57Z

Codecov Report

Merging #8733 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #8733   +/-   ##
=======================================
  Coverage   58.97%   58.97%           
=======================================
  Files         359      359           
  Lines       11333    11333           
  Branches     2787     2787           
=======================================
  Hits         6684     6684           
  Misses       4471     4471           
  Partials      178      178

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b95c1f...e094e84. Read the comment docs.

villebro

Really great to to be moving from pandas to pyarrow 👍 Apologies for the nit-fest!

villebro · 2019-12-17T20:29:48Z

superset/dataframe.py

-                column.pop("agg", None)
-            columns.append(column)
-        return columns
+def df_to_dict(dframe: pd.DataFrame) -> Dict:


nit: similar to how pd generally refers to Pandas, df commonly refers to DataFrame (as in the method name). Also add types to Dict, e.g. Dict[str, Any].

Using dframe here due to a linter complaining about df as an argument.

In reference to my comment below about df_to_dict being expected to return a dict, shouldn't this in fact return a List[Dict[str, Any]]? Looks like mypy missed this as data wasn't typed below, and is hence implicitly typed Any.

villebro · 2019-12-17T20:29:54Z

superset/dataframe.py

+            if isinstance(v, int):
+                if abs(v) > JS_MAX_INTEGER:
+                    d[k] = str(v)


if isintance(v, int) abs(v) > JS_MAX_INTEGER: d[k] = str(v)

villebro · 2019-12-17T20:33:39Z

superset/table.py

+    return new_l
+
+
+class SupersetTable(object):


py3 nit: remove (object)

villebro · 2019-12-17T20:34:25Z

superset/sql_lab.py

+    selected_columns: list = result_table.columns
    expanded_columns: list


nit: I'm guessing these are List[str]

villebro · 2019-12-17T20:35:41Z

superset/sql_lab.py

        # expand when loading data from results backend
        all_columns, expanded_columns = (selected_columns, [])
    else:
-        data = cdf.data or []
+        df = result_table.to_pandas_df()
+        data = df_to_dict(df) or []


shouldn't this be df_to_dict(df) or {}?

One would think, but this is calling to_dict(orient="records") which returns list of dicts:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html

Could be renamed df_to_dicts or df_to_records for clarity

villebro · 2019-12-17T20:47:46Z

superset/table.py

+        return None
+
+    @staticmethod
+    def convert_pa_dtype(pa_dtype: pa.DataType):


return type Optional[str]

villebro · 2019-12-17T20:48:28Z

superset/table.py

+            return "DATETIME"
+        return None
+
+    def to_pandas_df(self):


-> pd.DataFrame

villebro · 2019-12-17T20:48:56Z

superset/table.py

+        return self.pa_table_to_df(self.table)
+
+    @property
+    def pa_table(self):


-> pa.Table?

villebro · 2019-12-17T20:49:10Z

superset/table.py

+        return self.table
+
+    @property
+    def size(self):


villebro · 2019-12-17T20:50:22Z

superset/table.py

+        return self.table.num_rows
+
+    @property
+    def columns(self):


-> List[Dict[str, Any]]?

villebro

A few last minor nits, otherwise LGTM

villebro · 2019-12-30T19:40:27Z

superset/dataframe.py

-                column.pop("agg", None)
-            columns.append(column)
-        return columns
+def df_to_dict(dframe: pd.DataFrame) -> Dict:


In reference to my comment below about df_to_dict being expected to return a dict, shouldn't this in fact return a List[Dict[str, Any]]? Looks like mypy missed this as data wasn't typed below, and is hence implicitly typed Any.

villebro · 2019-12-30T19:41:16Z

superset/dataframe.py

+    # TODO: refactor this
+    for d in data:
+        for k, v in list(d.items()):
+            # if an int is too big for Java Script to handle


Nit pending

robdiciuccio · 2019-12-30T20:48:16Z

Profiling info added in the description (good news!). Unless there are any additional concerns (particularly around date detection) or testing against additional analytics databases, I think this should be ready to go.

robdiciuccio · 2019-12-30T20:57:19Z

Spoke too soon, investigating some dashboard discrepancies with the example data.

robdiciuccio · 2019-12-30T22:34:36Z

Dashboard discrepancies were due to stale cache. Retested dashboards and SQL Lab with fresh example data in postgres and mysql. LGTM.

villebro · 2019-12-31T06:37:43Z

@mistercrunch @willbarrett I propose merging this; any comments or good to go?

mistercrunch

I have 2 super minor comments, otherwise LGTM. It seems like type detection that relied on pandas magic in the past could be affected, but there's really no way to tell.

mistercrunch · 2020-01-02T18:27:10Z

superset/table.py

+    return new_l
+
+
+class SupersetTable:


"Table" can be confusing, how about SupersetResultSet

mistercrunch · 2020-01-02T18:31:55Z

superset/sql_lab.py

        # expand when loading data from results backend
        all_columns, expanded_columns = (selected_columns, [])
    else:
-        data = cdf.data or []
+        df = result_table.to_pandas_df()
+        data = df_to_dict(df) or []


Could be renamed df_to_dicts or df_to_records for clarity

…typing (apache#8733)" This reverts commit 6537d5e

graceguo-supercat · 2020-01-09T00:03:47Z

Hi @robdiciuccio I am trying to deploy this feature to airbnb production. But I got many errors from Presto queries:
Unserializable object ['silver.braavos_staging.revenue_validation_detail_metrics'] of type <class 'numpy.ndarray'>
or
Not implemented type for list in DataFrameBlock: struct<author_type: string, locale: string>

The full stack trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python3.6/dist-packages/flask_appbuilder/security/decorators.py", line 168, in wraps
    return f(self, *args, **kwargs)
  File "/home/grace_guo/incubator-superset/superset/utils/log.py", line 59, in wrapper
    value = f(*args, **kwargs)
  File "/home/grace_guo/incubator-superset/superset/views/core.py", line 2291, in results
    return self.results_exec(key)
  File "/home/grace_guo/incubator-superset/superset/views/core.py", line 2342, in results_exec
    json.dumps(obj, default=utils.json_iso_dttm_ser, ignore_nan=True)
  File "/usr/local/lib/python3.6/dist-packages/simplejson/__init__.py", line 399, in dumps
    **kw).encode(obj)
  File "/usr/local/lib/python3.6/dist-packages/simplejson/encoder.py", line 296, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.6/dist-packages/simplejson/encoder.py", line 378, in iterencode
    return _iterencode(o, 0)
  File "/home/grace_guo/incubator-superset/superset/utils/core.py", line 383, in json_iso_dttm_ser
    "Unserializable object {} of type {}".format(obj, type(obj))
TypeError: Unserializable object ['silver.information_schema.tables'] of type <class 'numpy.ndarray'>

So i have to revert this feature from airbnb's release branch. Please take a look. We should also consider revert this PR from master branch.

@mistercrunch @willbarrett

robdiciuccio · 2020-01-10T18:26:18Z

@graceguo-supercat thanks for flagging. This should be fixed in #8946

…typing (apache#8733)" This reverts commit 6537d5e.

…typing (apache#8733)

…typing (apache#8733)" This reverts commit 6537d5e.

robdiciuccio added 4 commits December 2, 2019 16:46

Use PyArrow Table for query result serialization

2246f3a

Cleanup dev comments

a57a2be

Additional cleanup

e526dac

WIP: tests

b6c952f

pull-request-size bot added the size/L label Dec 3, 2019

robdiciuccio added 3 commits December 3, 2019 11:20

Remove explicit dtype logic from db_engine_specs

d32758a

Remove obsolete column property

f9f8f4b

SupersetTable column types

0db911c

dpgaspar added the preset-io label Dec 4, 2019

Port SupersetDataFrame methods to SupersetTable

927c4f8

Add test for nullable boolean columns

a6e6b79

Support datetime values with timezone offsets

c32f999

robdiciuccio commented Dec 10, 2019

View reviewed changes

superset/assets/spec/javascripts/sqllab/ExploreResultsButton_spec.jsx Show resolved Hide resolved

robdiciuccio added 3 commits December 10, 2019 09:39

Merge branch master

f956ed3

Black formatting

6fba964

Pylint

7107122

robdiciuccio changed the title ~~WIP: Replace pandas.DataFrame with PyArrow.Table for nullable int typing~~ Replace pandas.DataFrame with PyArrow.Table for nullable int typing Dec 10, 2019

pull-request-size bot added size/XL and removed size/L labels Dec 10, 2019

robdiciuccio commented Dec 10, 2019

View reviewed changes

willbarrett reviewed Dec 10, 2019

View reviewed changes

More linting/formatting

9315309

villebro reviewed Dec 17, 2019

View reviewed changes

robdiciuccio added 2 commits December 19, 2019 17:20

Resolve issue with timezones not appearing in results

afc8ccc

Merge branch master

ce5bbd1

robdiciuccio added 4 commits December 29, 2019 19:59

Enable running of tests in tests/db_engine_specs

b291d01

Resolve application context errors

9e604cc

Refactor and add tests for pyodbc.Row conversion

b6abbb4

Appease isort, regardless of isort:skip

c5edbb3

robdiciuccio mentioned this pull request Dec 30, 2019

Enable running of tests in tests/db_engine_specs #8902

Merged

12 tasks

villebro approved these changes Dec 30, 2019

View reviewed changes

pull-request-size bot added size/XXL and removed size/XL labels Dec 30, 2019

Re-enable RESULTS_BACKEND_USE_MSGPACK default based on benchmarks

eedc044

Dataframe typing and nits

387c3da

Merge branch master

06dc6b5

mistercrunch approved these changes Jan 2, 2020

View reviewed changes

Renames to reduce ambiguity

e094e84

mistercrunch merged commit 6537d5e into apache:master Jan 3, 2020

mistercrunch deleted the rd/pyarrow-to-pandas branch January 3, 2020 16:55

graceguo-supercat pushed a commit to graceguo-supercat/superset that referenced this pull request Jan 8, 2020

Revert "Replace pandas.DataFrame with PyArrow.Table for nullable int …

a1cee6c

…typing (apache#8733)" This reverts commit 6537d5e

robdiciuccio mentioned this pull request Jan 9, 2020

pyarrow does not know how to serialize objects of type #8396

Closed

3 tasks

michellethomas pushed a commit to airbnb/superset-fork that referenced this pull request Jan 15, 2020

Revert "Replace pandas.DataFrame with PyArrow.Table for nullable int …

f2312a3

…typing (apache#8733)" This reverts commit 6537d5e.

robdiciuccio mentioned this pull request Jan 23, 2020

Failure adding TZ to timezone unaware postgres column. #8785

Closed

3 tasks

john-bodley pushed a commit to airbnb/superset-fork that referenced this pull request Jan 24, 2020

Revert "Replace pandas.DataFrame with PyArrow.Table for nullable int …

51daa8a

…typing (apache#8733)

etr2460 pushed a commit to etr2460/incubator-superset that referenced this pull request Feb 1, 2020

Revert "Replace pandas.DataFrame with PyArrow.Table for nullable int …

1a4e826

…typing (apache#8733)" This reverts commit 6537d5e.

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.36.0 labels Feb 28, 2024

		selected_columns: list = result_table.columns
		expanded_columns: list

Replace pandas.DataFrame with PyArrow.Table for nullable int typing #8733

Replace pandas.DataFrame with PyArrow.Table for nullable int typing #8733

Conversation

robdiciuccio commented Dec 3, 2019 • edited Loading

CATEGORY

SUMMARY

TEST PLAN

ADDITIONAL INFORMATION

REVIEWERS

BENCHMARKS (2019-12-30)

robdiciuccio commented Dec 4, 2019

john-bodley commented Dec 6, 2019

robdiciuccio commented Dec 6, 2019

robdiciuccio Dec 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Dec 10, 2019 • edited Loading

Codecov Report

villebro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

villebro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robdiciuccio commented Dec 30, 2019

robdiciuccio commented Dec 30, 2019

robdiciuccio commented Dec 30, 2019

villebro commented Dec 31, 2019

mistercrunch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graceguo-supercat commented Jan 9, 2020 • edited Loading

robdiciuccio commented Jan 10, 2020

robdiciuccio commented Dec 3, 2019 •

edited

Loading

robdiciuccio Dec 10, 2019 •

edited

Loading

codecov-io commented Dec 10, 2019 •

edited

Loading

graceguo-supercat commented Jan 9, 2020 •

edited

Loading