-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SQL Lab] Async query results serialization with MessagePack and PyArrow #8069
[SQL Lab] Async query results serialization with MessagePack and PyArrow #8069
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8069 +/- ##
==========================================
+ Coverage 65.89% 65.94% +0.04%
==========================================
Files 485 485
Lines 22917 22961 +44
Branches 2537 2537
==========================================
+ Hits 15102 15142 +40
- Misses 7683 7687 +4
Partials 132 132
Continue to review full report at Codecov.
|
cc @etr2460 on SQL Lab backend performance work |
@@ -214,6 +214,7 @@ def index(self): | |||
security_manager = appbuilder.sm | |||
|
|||
results_backend = app.config.get("RESULTS_BACKEND") | |||
results_backend_use_msgpack = app.config.get("RESULTS_BACKEND_USE_MSGPACK") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I think we like to add the config var to the config.py
file so that it's easier to see all the configurations available. That said, I'm wondering if this should be a config var at all if it has such good performance improvements. If you're worried about it being globally enabled without testing, maybe make this a feature flag that we can default off to begin with, but then remove and enable everywhere once we're sure everything is good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, all configs should be set as default in superset/config.py
and commented/documented there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feature flag feels a bit heavy for this case, since it's a global setting and not needed on the frontend. My main concern about pushing this change without a config flag is the lack of testing with data sources other than Postgres, but I do think we should push that testing forward by perhaps defaulting RESULTS_BACKEND_USE_MSGPACK = True
, and providing an escape hatch to disable should problems crop up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe I have a misunderstanding of the semantics around making something a feature flag
. I understood it as gating a new feature that we would want to roll out to 100% of users in the future. So while testing people can enable/disable the feature, but with the expectation of in the future it being enabled by default. Feature flags are then necessary to implement on the frontend to support that goal, but aren't required to be both frontend and backend changes. Maybe @mistercrunch can clarify which understanding is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the feature flag framework is limited, so open to feedback here. Since the change is more middleware than feature, and adding the flag to the JS payload isn't necessary, it feels more like config, but curious what others think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config keys are static, environment-wide scoped.
Feature flags can be dynamic and currently they all flow to the frontend. Being dynamic, they can be used to do progressive rollouts or A/B testing.
This current flag in this PR seems more like the former to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall as noted inline I'm thinking msgpack + pyarrow is superior and should be the new default. The feature flag rolls with a sub-optimal default and two code paths to maintain. I vote for pushing this forward.
Closing & reopening to re-trigger Travis build |
Rebasing might solve the JS build issue, been discussed on Superset Slack #airbnb-lyft-sprint channel |
CATEGORY
Choose one
SUMMARY
Async query performance in SQL Lab, particularly with large result sets, is fairly poor due to how data is serialized and stored. This PR introduces a
RESULTS_BACKEND_USE_MSGPACK
config option to use PyArrow to serialize the pandas DataFrame directly, and MessagePack for serializing the results payload. Compared to the existing JSON serialization, Arrow and MessagePack provide improved performance and result in much smaller payloads sent to S3 or other cache backends.Benchmarks with 100K rows of
birth_names
examples data (multiple runs)JSON
avg serialization: 573ms
avg deserialization: 200ms
total: 773ms
compressed payload size: 816718
peak memory usage: 281.1 MiB
Arrow/msgpack
avg serialization: 70ms
avg deserialization: 40ms
total: 110ms
compressed payload size: 452634
peak memory usage: 266.2 MiB
Benchmarks were performed on a Macbook Pro 2.6 GHz i7, 32GB running macOS 10.14.5 and Python 3.6.8. memory-profiler was used for memory usage stats.
TEST PLAN
Testing thus far has been limited to Postgres, mainly on Superset examples data. For full compatibility testing:
RESULTS_BACKEND_USE_MSGPACK = True
insuperset_config.py
ADDITIONAL INFORMATION
REVIEWERS
@mistercrunch