Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Nov 2, 2016

What changes were proposed in this pull request?

As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client.

We have addressed this issue for static partition at #15667. This is a follow-up pr for #15667 to address dynamic partition.

How was this patch tested?

Jenkins tests.

There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into dynamic partition.

For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

@viirya
Copy link
Member Author

viirya commented Nov 2, 2016

cc @snodawn Would you like to test this patch for dynamic partition? Thanks.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67945 has finished for PR 15726 at commit eae8f1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 2, 2016

cc @ericl

Do we need to do this for data source tables?

@ericl
Copy link
Contributor

ericl commented Nov 2, 2016

In datasource tables we already delete the partition beforehand, so this
should not be needed (we also don't follow the hive insert path so don't
know if the perf regression exists).

On Wed, Nov 2, 2016, 12:07 AM Reynold Xin notifications@github.com wrote:

cc @ericl https://github.com/ericl

Do we need to do this for data source tables?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15726 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6SlK6-sNgly91PcU9d1v8qnwvNpdaks5q6Da5gaJpZM4Kmy8E
.

@snodawn
Copy link

snodawn commented Nov 2, 2016

@viirya Ok, I will try it soon.

@snodawn
Copy link

snodawn commented Nov 2, 2016

I have tested the new patch for dynamic partition. It still costs a long time in running overwrite statement as the same with hive 1.2.1. The execution logs show that when running in dynamic partition it move each file of the partition to .Trash instead of the whole partition, which may cost a lot of time in this way.

@viirya
Copy link
Member Author

viirya commented Nov 2, 2016

@snodawn Thanks for reporting this.

One thing I want to make sure is how you test that? Are you insert the partition first and then overwrite to the existing partition? Or you just use insert overwrite to write to a new partition, i.e., actually it is not overwriting?

@viirya
Copy link
Member Author

viirya commented Nov 2, 2016

@snodawn OK. I got the reason why dynamic partition is still much slower than Hive 2.0.

There is another patch to optimize dynamic partition, apache/hive@d297b51.

Basically it optimizes the sequential dynamic partition insertion as many asynchronous tasks with an executor pool.

We can also do it in InsertIntoHiveTable for dynamic partition. Don't know if it is worth? @ericl @rxin What do you think?

@ericl
Copy link
Contributor

ericl commented Nov 2, 2016

IIUC, you would have to call loadPartition in parallel for each new partition created instead of loadDynamicPartitions once? There might be an issue there since each Hive client operation in Spark currently holds a global lock. So it would all be serialized anyways.

@viirya
Copy link
Member Author

viirya commented Nov 3, 2016

@ericl Thanks. Looks like we have more than one level lock (at least two in HiveExternalCatalog, HiveClientImpl). This might hard to tackle.

Although it is still possibly to have a workaround by having customized methods to wrap those loadPartition as one task which obtains the locks, I think the work may not be worth.

@ericl @rxin Do you agree that?

@snodawn
Copy link

snodawn commented Nov 3, 2016

@viirya I test both inserting overwrite a new partition and a existing partition. Of cause, inserting overwrite a new partition runs faster.

@ericl
Copy link
Contributor

ericl commented Nov 3, 2016

Yeah, this sounds like more complexity than it's worth. We should probably fix the hive client locking issue first.

@viirya
Copy link
Member Author

viirya commented Nov 4, 2016

@ericl About the hive client locking issue, any thing you can suggest?

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68102 has finished for PR 15726 at commit 4624f1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ericl
Copy link
Contributor

ericl commented Nov 4, 2016

Hm, iirc the issue is not super hard to fix, but basically since the hive
thrift client is not thread safe, only one client can use it at a time. We
would need some sort of hive client pool to solve the issue (search for
retryLocked in the hive client management code to see the global lock.)

On Thu, Nov 3, 2016, 9:14 PM UCB AMPLab notifications@github.com wrote:

Merged build finished. Test PASSed.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15726 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6Slm6mZjAdO3bnW30wira4IoDnSDMks5q6rE2gaJpZM4Kmy8E
.

@viirya
Copy link
Member Author

viirya commented Nov 4, 2016

@ericl yeah, I have checked the codes with retryLocked in HiveClientImpl. Do you mean we can create multiple hive client in the pool and serve concurrently?

@ericl
Copy link
Contributor

ericl commented Nov 4, 2016

@viirya I think there are a few options there, either HiveClientImpl can create multiple internal thrift clients (see private def client there), or the external catalog could create multiple clients.

@viirya
Copy link
Member Author

viirya commented Nov 5, 2016

@ericl Currently I prefer the first one, let HiveClientImpl create multiple internal thrift clients, since I don't like to change external catalog for this.

@ericl
Copy link
Contributor

ericl commented Nov 5, 2016

@viirya that makes sense to me

@viirya
Copy link
Member Author

viirya commented Nov 21, 2016

@ericl I am thinking this recently. What I am not very sure is this multiple hive client approach is safe to use under multiple thread environment. E.g., for now, because we synchronize on the single hive client, we run hive operations in sequence. Once we have multiple hive clients, would the concurrent hive operations conflict each other?

My first thought is because the hive operations use metastore, these operations would need to acquire some locks on the items (e.g., tables) in metastore before running. Is my guess correct or not?

@ericl
Copy link
Contributor

ericl commented Nov 22, 2016

@yhuai do you know if it would be safe to have multiple concurrent Hive operations in HiveClientImpl. From a cursory audit of the code it seems that only thread-local state is mutated for withHiveState so maybe it's no different from having multiple Spark clusters connect to the same metastore.

@viirya
Copy link
Member Author

viirya commented Jan 5, 2017

I would close this for now and may be reopen this when we get correct answer from @yhuai.

@viirya viirya closed this Jan 5, 2017
@yuananf
Copy link

yuananf commented Apr 6, 2017

@viirya Is it possible to upgrade the built-in hive-exec to resolve this problem? We are facing the same problem, insert overwrite dynamic partition is extremely slow, the data written is over in 5 minutes, but the following action takes more than 1 hour.

I believe the built-in hive-exec is this https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
Can we just upgrade this to release-2.1.1-spark2 or something else?

@grantnicholas
Copy link

@viirya @yuananf a quick check of recent spark releases shows this fix is not in. Any suggested workarounds in the meantime for dynamic partition insert overwrites?

It sounds like if the user does the logic of deleting the necessary partitions before running the dynamic insert overwrite query then hive will go down the "happy" performant path. This will require calculating the dynamic partitions before running the insert query, but if you can do that then this workaround will work right?

@kaiseu
Copy link

kaiseu commented May 16, 2019

Do we have any solutions so far to resolve or workaround this issue? Spark2.4.3 also encountered this problem.

@viirya
Copy link
Member Author

viirya commented May 16, 2019

Hmm, since Spark community is working on upgrading Hive version in Spark, I think once it is done, this shouldn't be an issue after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants