-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18107][SQL][FOLLOW-UP] Insert overwrite statement runs much slower in spark-sql than it does in hive-client #15726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18107][SQL][FOLLOW-UP] Insert overwrite statement runs much slower in spark-sql than it does in hive-client #15726
Conversation
|
cc @snodawn Would you like to test this patch for dynamic partition? Thanks. |
|
Test build #67945 has finished for PR 15726 at commit
|
|
cc @ericl Do we need to do this for data source tables? |
|
In datasource tables we already delete the partition beforehand, so this On Wed, Nov 2, 2016, 12:07 AM Reynold Xin notifications@github.com wrote:
|
|
@viirya Ok, I will try it soon. |
|
I have tested the new patch for dynamic partition. It still costs a long time in running overwrite statement as the same with hive 1.2.1. The execution logs show that when running in dynamic partition it move each file of the partition to .Trash instead of the whole partition, which may cost a lot of time in this way. |
|
@snodawn Thanks for reporting this. One thing I want to make sure is how you test that? Are you insert the partition first and then overwrite to the existing partition? Or you just use insert overwrite to write to a new partition, i.e., actually it is not overwriting? |
|
@snodawn OK. I got the reason why dynamic partition is still much slower than Hive 2.0. There is another patch to optimize dynamic partition, apache/hive@d297b51. Basically it optimizes the sequential dynamic partition insertion as many asynchronous tasks with an executor pool. We can also do it in |
|
IIUC, you would have to call |
|
@ericl Thanks. Looks like we have more than one level lock (at least two in Although it is still possibly to have a workaround by having customized methods to wrap those |
|
@viirya I test both inserting overwrite a new partition and a existing partition. Of cause, inserting overwrite a new partition runs faster. |
|
Yeah, this sounds like more complexity than it's worth. We should probably fix the hive client locking issue first. |
…rtoverwrite-followup
|
@ericl About the hive client locking issue, any thing you can suggest? |
|
Test build #68102 has finished for PR 15726 at commit
|
|
Hm, iirc the issue is not super hard to fix, but basically since the hive On Thu, Nov 3, 2016, 9:14 PM UCB AMPLab notifications@github.com wrote:
|
|
@ericl yeah, I have checked the codes with |
|
@viirya I think there are a few options there, either |
|
@ericl Currently I prefer the first one, let |
|
@viirya that makes sense to me |
|
@ericl I am thinking this recently. What I am not very sure is this multiple hive client approach is safe to use under multiple thread environment. E.g., for now, because we synchronize on the single hive client, we run hive operations in sequence. Once we have multiple hive clients, would the concurrent hive operations conflict each other? My first thought is because the hive operations use metastore, these operations would need to acquire some locks on the items (e.g., tables) in metastore before running. Is my guess correct or not? |
|
@yhuai do you know if it would be safe to have multiple concurrent Hive operations in |
|
I would close this for now and may be reopen this when we get correct answer from @yhuai. |
|
@viirya Is it possible to upgrade the built-in hive-exec to resolve this problem? We are facing the same problem, insert overwrite dynamic partition is extremely slow, the data written is over in 5 minutes, but the following action takes more than 1 hour. I believe the built-in hive-exec is this https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 |
|
@viirya @yuananf a quick check of recent spark releases shows this fix is not in. Any suggested workarounds in the meantime for dynamic partition insert overwrites? It sounds like if the user does the logic of deleting the necessary partitions before running the dynamic insert overwrite query then hive will go down the "happy" performant path. This will require calculating the dynamic partitions before running the insert query, but if you can do that then this workaround will work right? |
|
Do we have any solutions so far to resolve or workaround this issue? Spark2.4.3 also encountered this problem. |
|
Hmm, since Spark community is working on upgrading Hive version in Spark, I think once it is done, this shouldn't be an issue after that. |
What changes were proposed in this pull request?
As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client.
We have addressed this issue for static partition at #15667. This is a follow-up pr for #15667 to address dynamic partition.
How was this patch tested?
Jenkins tests.
There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into dynamic partition.
For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira.
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.