Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up presto queries! #10139

Closed
tooptoop4 opened this issue Jun 23, 2020 · 14 comments · Fixed by #10191
Closed

Speed up presto queries! #10139

tooptoop4 opened this issue Jun 23, 2020 · 14 comments · Fixed by #10191
Labels
change:backend Requires changing the backend enhancement:request Enhancement request submitted by anyone from the community

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented Jun 23, 2020

"SELECT 1" - the simplest of presto queries takes over 3 seconds (3.18-3.4 seconds for me) when run from superset but only 0.2 seconds when run using presto JDBC.

Sleep 1 second at this line is too long:
https://github.com/apache/incubator-superset/blob/0.36.0/superset/db_engine_specs/presto.py#L760

see dropbox/PyHive#337 for another culprit line

I changed to 0.1 for both cases and now I can get 0.47 seconds from superset!

I can also get select count(1) from sometable in 0.86 seconds from superset after the 2 changes (was 3.5 seconds before)

@tooptoop4 tooptoop4 added the enhancement:request Enhancement request submitted by anyone from the community label Jun 23, 2020
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label #enhancement to this issue, with a confidence of 0.64. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@etr2460
Copy link
Member

etr2460 commented Jun 24, 2020

@john-bodley @villebro @mistercrunch, is this a viable solution? What are the performance/load implications on polling a cursor 10 times more often? Is there a way where we don't need to poll the cursor but instead use some async magic to await on a response?

@etr2460 etr2460 added the change:backend Requires changing the backend label Jun 24, 2020
@ktmud
Copy link
Member

ktmud commented Jun 24, 2020

Can probably adjust the polling interval according to number of retries, i.e., longer interval if waited longer.

time.sleep(min(0.2 + 0.5 * num_retries, 5))

@villebro
Copy link
Member

villebro commented Jun 24, 2020

Good idea to use some form of exponential or other backoff. However, perhaps leverage backoff, as it's already in the python deps. If Presto can handle shorter polling intervals, why not. But that discussion needs to be resolved on PyHive. @bkyryliuk thoughts?

Having said all this, while unnecessary waiting should be avoided, I'm personally of the inclination OLAP-type queries are ok to take a few seconds. As I expect people will tend to be hitting fairly expensive queries through SQL Lab, waiting for a few seconds for something that completed in 0.1 seconds, and conversely waiting 32 seconds for something that took 30.1 seconds, usually isn't that big a deal for me. Especially if it can spare a busy db some unnecessary load.

@etr2460
Copy link
Member

etr2460 commented Jun 24, 2020

Echoing @villebro and @ktmud, some sort of backoff sounds good, so that a 0.1 second query comes back in 0.2 or 0.4 seconds, but the 30 second query comes back in 32.

I will note that performance around the 1 to 5 second long queries does appear especially bad on Superset's side, especially because of all the steps required to run the query:

  1. receive the request
  2. schedule a task for a worker
  3. pick up the worker task
  4. Run the query (polling ever 2 seconds)
  5. Write the result to the results backend (s3)
  6. Wait for the next poll from the client side (also every 2 seconds I think)
  7. Download the results from s3

This means that a query that takes a second or 2 on cli, can take 2-3 times as long in sql lab. Any improvements we can make around this without overloading services seems worth considering

In the long term, moving to the approach outlined in #9190 will solve a lot of these issues too, but i think updating the cursor wait time here is a valid thing to start with

@tooptoop4
Copy link
Contributor Author

tooptoop4 commented Jun 24, 2020

I reckon there is no harm at polling every 0.1 seconds, presto jdbc driver does that anyway (website says "Presto is an open source distributed SQL query engine for running interactive analytic queries"). I heard of the 3 second rule with keeping attention span (https://paceadv.com/the-3-second-rule/), if your query takes 3.5seconds you have a bit of disengagement but if its a snappy < 1.5 seconds stay hooked. If others concerned about increasing load on presto then perhaps allow these to be set in config of the data source?

@ktmud
Copy link
Member

ktmud commented Jun 24, 2020

  • Write the result to the results backend (s3)

Is it required for Superset to write query results cache to s3? Why not Redis? And why does it need to download the cache from s3 again?

@bkyryliuk
Copy link
Member

@ktmud it is configurable in superset config, s3 is more scalable if you have many users and large results

@villebro it seems like poll interval is configurable: https://github.com/dropbox/PyHive/blob/8eb0aeab8ca300f3024655419b93dad926c1a351/pyhive/presto.py#L93 I don't have capacity to make the change, but can review the code if needed to expose it as sqllachemy conn stirng parameter

@john-bodley
Copy link
Member

SQLAlchemy supports the setting this per here so no changes to PyHive are necessary.

@etr2460
Copy link
Member

etr2460 commented Jun 25, 2020

@ktmud Perhaps I was a bit unclear in my description. When running the query on the async worker, once completed it gets written to a results backend. We use s3 for the reasons @bkyryliuk said. We then mark the query as successfully completed, and add the cache key to the query record

Then, we need to get the results to the client. This is done by polling the queries endpoint every 2 seconds. Once the query is labeled as success in that response, we call Superset again with the cache key on the query object to actually download the results

@ktmud
Copy link
Member

ktmud commented Jun 25, 2020

Thanks for the detailed explanation, @etr2460 and @bkyryliuk .

@etr2460
Copy link
Member

etr2460 commented Jun 26, 2020

I started working on this today (bf0025f), but I don't think it makes a lot of sense to add a backoff within Superset if pyhive only supports constant polling intervals. Therefore, I have a few options, ranked in my preference order:

  1. Update the sleep within handle_cursor to be equal to the poll_interval configured in the presto database's engine_params object. then both params will be aligned and fully configurable by the admin
  2. Open a PR against pyhive to support using a generator function as well as a constant value for poll_interval, then update Superset to default to some variety of fibo backoff for both
  3. Manually update the constant sleep value in Superset to something smaller than 1 (maybe 0.5 for now) and also default the presto engine params to also send 0.5. This isn't configurable, but it would have immediate performance improvements

@john-bodley @bkyryliuk @ktmud @villebro @tooptoop4, thoughts?

@bkyryliuk
Copy link
Member

I like approach #1.
Honestly speaking getting the query results within 2 seconds vs 0.1 second doesn't seem like a big user win, thanks why I would prefer the most simple approach here.

@etr2460
Copy link
Member

etr2460 commented Jun 30, 2020

@tooptoop4 if you use master branch then you should be able to set the poll_interval param in database extra => engine_params => connect_args and that'll control both PyHive's and Superset's poll interval at the same time. Hope this works for you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:backend Requires changing the backend enhancement:request Enhancement request submitted by anyone from the community
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants