Speed up presto queries! #10139

tooptoop4 · 2020-06-23T07:05:14Z

"SELECT 1" - the simplest of presto queries takes over 3 seconds (3.18-3.4 seconds for me) when run from superset but only 0.2 seconds when run using presto JDBC.

Sleep 1 second at this line is too long:
https://github.com/apache/incubator-superset/blob/0.36.0/superset/db_engine_specs/presto.py#L760

see dropbox/PyHive#337 for another culprit line

I changed to 0.1 for both cases and now I can get 0.47 seconds from superset!

I can also get select count(1) from sometable in 0.86 seconds from superset after the 2 changes (was 3.5 seconds before)

issue-label-bot · 2020-06-23T07:05:17Z

Issue-Label Bot is automatically applying the label #enhancement to this issue, with a confidence of 0.64. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

etr2460 · 2020-06-24T18:24:27Z

@john-bodley @villebro @mistercrunch, is this a viable solution? What are the performance/load implications on polling a cursor 10 times more often? Is there a way where we don't need to poll the cursor but instead use some async magic to await on a response?

ktmud · 2020-06-24T18:39:48Z

Can probably adjust the polling interval according to number of retries, i.e., longer interval if waited longer.

time.sleep(min(0.2 + 0.5 * num_retries, 5))

villebro · 2020-06-24T18:50:19Z

Good idea to use some form of exponential or other backoff. However, perhaps leverage backoff, as it's already in the python deps. If Presto can handle shorter polling intervals, why not. But that discussion needs to be resolved on PyHive. @bkyryliuk thoughts?

Having said all this, while unnecessary waiting should be avoided, I'm personally of the inclination OLAP-type queries are ok to take a few seconds. As I expect people will tend to be hitting fairly expensive queries through SQL Lab, waiting for a few seconds for something that completed in 0.1 seconds, and conversely waiting 32 seconds for something that took 30.1 seconds, usually isn't that big a deal for me. Especially if it can spare a busy db some unnecessary load.

etr2460 · 2020-06-24T21:00:04Z

Echoing @villebro and @ktmud, some sort of backoff sounds good, so that a 0.1 second query comes back in 0.2 or 0.4 seconds, but the 30 second query comes back in 32.

I will note that performance around the 1 to 5 second long queries does appear especially bad on Superset's side, especially because of all the steps required to run the query:

receive the request
schedule a task for a worker
pick up the worker task
Run the query (polling ever 2 seconds)
Write the result to the results backend (s3)
Wait for the next poll from the client side (also every 2 seconds I think)
Download the results from s3

This means that a query that takes a second or 2 on cli, can take 2-3 times as long in sql lab. Any improvements we can make around this without overloading services seems worth considering

In the long term, moving to the approach outlined in #9190 will solve a lot of these issues too, but i think updating the cursor wait time here is a valid thing to start with

tooptoop4 · 2020-06-24T21:36:41Z

I reckon there is no harm at polling every 0.1 seconds, presto jdbc driver does that anyway (website says "Presto is an open source distributed SQL query engine for running interactive analytic queries"). I heard of the 3 second rule with keeping attention span (https://paceadv.com/the-3-second-rule/), if your query takes 3.5seconds you have a bit of disengagement but if its a snappy < 1.5 seconds stay hooked. If others concerned about increasing load on presto then perhaps allow these to be set in config of the data source?

ktmud · 2020-06-24T22:20:36Z

Write the result to the results backend (s3)

Is it required for Superset to write query results cache to s3? Why not Redis? And why does it need to download the cache from s3 again?

bkyryliuk · 2020-06-25T00:15:46Z

@ktmud it is configurable in superset config, s3 is more scalable if you have many users and large results

@villebro it seems like poll interval is configurable: https://github.com/dropbox/PyHive/blob/8eb0aeab8ca300f3024655419b93dad926c1a351/pyhive/presto.py#L93 I don't have capacity to make the change, but can review the code if needed to expose it as sqllachemy conn stirng parameter

john-bodley · 2020-06-25T01:30:20Z

SQLAlchemy supports the setting this per here so no changes to PyHive are necessary.

etr2460 · 2020-06-25T16:13:02Z

@ktmud Perhaps I was a bit unclear in my description. When running the query on the async worker, once completed it gets written to a results backend. We use s3 for the reasons @bkyryliuk said. We then mark the query as successfully completed, and add the cache key to the query record

Then, we need to get the results to the client. This is done by polling the queries endpoint every 2 seconds. Once the query is labeled as success in that response, we call Superset again with the cache key on the query object to actually download the results

ktmud · 2020-06-25T16:30:28Z

Thanks for the detailed explanation, @etr2460 and @bkyryliuk .

etr2460 · 2020-06-26T16:02:50Z

I started working on this today (bf0025f), but I don't think it makes a lot of sense to add a backoff within Superset if pyhive only supports constant polling intervals. Therefore, I have a few options, ranked in my preference order:

Update the sleep within handle_cursor to be equal to the poll_interval configured in the presto database's engine_params object. then both params will be aligned and fully configurable by the admin
Open a PR against pyhive to support using a generator function as well as a constant value for poll_interval, then update Superset to default to some variety of fibo backoff for both
Manually update the constant sleep value in Superset to something smaller than 1 (maybe 0.5 for now) and also default the presto engine params to also send 0.5. This isn't configurable, but it would have immediate performance improvements

@john-bodley @bkyryliuk @ktmud @villebro @tooptoop4, thoughts?

bkyryliuk · 2020-06-26T16:51:49Z

I like approach #1.
Honestly speaking getting the query results within 2 seconds vs 0.1 second doesn't seem like a big user win, thanks why I would prefer the most simple approach here.

etr2460 · 2020-06-30T15:57:37Z

@tooptoop4 if you use master branch then you should be able to set the poll_interval param in database extra => engine_params => connect_args and that'll control both PyHive's and Superset's poll interval at the same time. Hope this works for you!

tooptoop4 added the enhancement:request Enhancement request submitted by anyone from the community label Jun 23, 2020

etr2460 added the change:backend Requires changing the backend label Jun 24, 2020

etr2460 mentioned this issue Jun 29, 2020

feat: add configuration for Presto cursor poll interval #10191

Merged

6 tasks

etr2460 closed this as completed in #10191 Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up presto queries! #10139

Speed up presto queries! #10139

tooptoop4 commented Jun 23, 2020 •

edited

Loading

issue-label-bot bot commented Jun 23, 2020

etr2460 commented Jun 24, 2020

ktmud commented Jun 24, 2020

villebro commented Jun 24, 2020 •

edited

Loading

etr2460 commented Jun 24, 2020 •

edited

Loading

tooptoop4 commented Jun 24, 2020 •

edited

Loading

ktmud commented Jun 24, 2020

bkyryliuk commented Jun 25, 2020

john-bodley commented Jun 25, 2020

etr2460 commented Jun 25, 2020

ktmud commented Jun 25, 2020

etr2460 commented Jun 26, 2020 •

edited

Loading

bkyryliuk commented Jun 26, 2020

etr2460 commented Jun 30, 2020

Speed up presto queries! #10139

Speed up presto queries! #10139

Comments

tooptoop4 commented Jun 23, 2020 • edited Loading

issue-label-bot bot commented Jun 23, 2020

etr2460 commented Jun 24, 2020

ktmud commented Jun 24, 2020

villebro commented Jun 24, 2020 • edited Loading

etr2460 commented Jun 24, 2020 • edited Loading

tooptoop4 commented Jun 24, 2020 • edited Loading

ktmud commented Jun 24, 2020

bkyryliuk commented Jun 25, 2020

john-bodley commented Jun 25, 2020

etr2460 commented Jun 25, 2020

ktmud commented Jun 25, 2020

etr2460 commented Jun 26, 2020 • edited Loading

bkyryliuk commented Jun 26, 2020

etr2460 commented Jun 30, 2020

tooptoop4 commented Jun 23, 2020 •

edited

Loading

villebro commented Jun 24, 2020 •

edited

Loading

etr2460 commented Jun 24, 2020 •

edited

Loading

tooptoop4 commented Jun 24, 2020 •

edited

Loading

etr2460 commented Jun 26, 2020 •

edited

Loading