[SPARK-36992][SQL] Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray #34267

ulysses-you · 2021-10-13T01:48:34Z

What changes were proposed in this pull request?

Unify the getPrefix function of UTF8String and ByteArray.

Why are the changes needed?

When execute sort operator, we first compare the prefix. However the getPrefix function of byte array is slow. We use first 8 bytes as the prefix, so at most we will call 8 times with Platform.getByte which is slower than call once with Platform.getInt or Platform.getLong.

Does this PR introduce any user-facing change?

no

How was this patch tested?

pass org.apache.spark.util.collection.unsafe.sort.PrefixComparatorsSuite

SparkQA · 2021-10-13T03:01:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48649/

SparkQA · 2021-10-13T03:56:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48649/

SparkQA · 2021-10-13T04:09:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48653/

SparkQA · 2021-10-13T05:06:01Z

Test build #144171 has finished for PR 34267 at commit 7f5c441.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-13T05:13:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48653/

SparkQA · 2021-10-13T05:17:35Z

Test build #144175 has finished for PR 34267 at commit 5b7fb3e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xkrogen · 2021-10-13T16:11:38Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

I see this code is just copied from UTF8String, but it seems like we could make it much more concise, and emphasize the differences between big/little endian more, by combining the branches in a different manner:

final long p; final long mask; if (numBytes >= 8) { p = Platform.getLong(base, offset); mask = 0; } else if (numBytes > 4) { p = Platform.getLong(base, offset); mask = (1L << (8 - numBytes) * 8) - 1; } else if (numBytes > 0) { long pRaw = Platform.getInt(base, offset); p = IS_LITTLE_ENDIAN ? pRaw : (pRaw << 32); mask = (1L << (8 - numBytes) * 8) - 1; } else { p = 0; mask = 0; } final long pBigEndian = IS_LITTLE_ENDIAN ? java.lang.Long.reverseBytes(p) : p; return pBigEndian & ~mask;

thank you @xkrogen for the review. If you don't mind, I leave it now for help other reviewers to get this PR's origin idea easily since this is for code simplification. also cc @srowen , @cloud-fan , @kiszk

I agree we can make this more concise like so

I agree, too

ok, updated it. thank you @xkrogen @srowen @kiszk !

srowen · 2021-10-15T01:46:08Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

Rather than expose this just let each class have its private copy

srowen · 2021-10-15T01:46:55Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

I agree we can make this more concise like so

srowen · 2021-10-15T01:47:20Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

package-private?

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

kiszk · 2021-10-15T05:58:47Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

nit: 0 -> 1

kiszk · 2021-10-15T06:05:01Z

After all code change is finished, could you please compare the result of SortBenchmark w/o and w this PR?
Sort is one of the critical operations in SQL.

…g and ByteArray

ulysses-you · 2021-10-15T08:05:13Z

@kiszk , I looked at SortBenchmark seems it doesn't contain any code path with this PR. So I ran this benchmark:

val count = 100 * 1000 * 1000
benchmark.addTimerCase("byte array get prefix") { timer =>
  val array = Array.tabulate[Byte](8) { i => i.toByte }
  timer.startTiming()
  (1 to count).foreach(_ => BinaryPrefixComparator.computePrefix(array))
  timer.stopTiming()
}

Before this PR:

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_271-b09 on Mac OS X 10.16
[info] Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
[info] rows: 100000000:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] byte array get prefix                              1414           1543         182         70.7          14.1       1.0X

After this PR:

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_271-b09 on Mac OS X 10.16
[info] Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
[info] rows: 100000000:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] byte array get prefix                               237            246           7        421.4           2.4       1.0X

cloud-fan · 2021-10-15T08:31:50Z

looks fine

SparkQA · 2021-10-15T09:06:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48778/

SparkQA · 2021-10-15T09:46:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48778/

SparkQA · 2021-10-15T11:05:14Z

Test build #144299 has finished for PR 34267 at commit 91ce4a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-10-15T16:13:33Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

+    return getPrefix(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length);
+  }
+
+  protected static long getPrefix(Object base, long offset, int numBytes) {


This can be package private (remove protected) but it doesn't matter - it's a final class anyway

removed it if you preferred

JoshRosen

LGTM.

I was curious to know if there was a reason why these codepaths were originally separate, so I dug into the commit history to find out:

Sorting of binary type data was originally introduced in #2617. This implemented an ordering based on signed byte array comparisons.
PrefixComparators support for BinaryType was added in #7676. There was some discussion of copying the approach from UTF8String but we didn't do that because of the need to use signed comparisons.
Later in #18571 we realized that the use of signed comparison for byte arrays was incorrect and switched to unsigned comparisons.

This explains why those paths historically diverged. I'm glad that this PR cleaned things up and got a performance benefit along the way.

kiszk · 2021-10-16T03:34:03Z

@ulysses-you looks good.

I believe Since JVMs are either 4-byte aligned or 8-byte aligned, we check the size of the bytes is correct now.
For future SW and HW platforms, can we add simple test cases for ByteArray that accesses 1, 2, ...., 7, 8 bytes. It would help future porting to new platforms.

SparkQA · 2021-10-16T11:14:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48803/

SparkQA · 2021-10-16T11:57:01Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48803/

SparkQA · 2021-10-16T12:59:40Z

Test build #144324 has finished for PR 34267 at commit 9fcc06b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk

Thank you

srowen · 2021-10-16T14:54:00Z

Merged to master

ulysses-you · 2021-10-17T12:00:19Z

thank you all !

…nction of UTF8String and ByteArray ### What changes were proposed in this pull request? Unify the getPrefix function of `UTF8String` and `ByteArray`. ### Why are the changes needed? When execute sort operator, we first compare the prefix. However the getPrefix function of byte array is slow. We use first 8 bytes as the prefix, so at most we will call 8 times with `Platform.getByte` which is slower than call once with `Platform.getInt` or `Platform.getLong`. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass `org.apache.spark.util.collection.unsafe.sort.PrefixComparatorsSuite` Closes apache#34267 from ulysses-you/binary-prefix. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>

* [SPARK-36992][SQL] Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray ### What changes were proposed in this pull request? Unify the getPrefix function of `UTF8String` and `ByteArray`. ### Why are the changes needed? When execute sort operator, we first compare the prefix. However the getPrefix function of byte array is slow. We use first 8 bytes as the prefix, so at most we will call 8 times with `Platform.getByte` which is slower than call once with `Platform.getInt` or `Platform.getLong`. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass `org.apache.spark.util.collection.unsafe.sort.PrefixComparatorsSuite` Closes #34267 from ulysses-you/binary-prefix. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-37037][SQL] Improve byte array sort by unify compareTo function of UTF8String and ByteArray ### What changes were proposed in this pull request? Unify the compare function of `UTF8String` and `ByteArray`. ### Why are the changes needed? `BinaryType` use `TypeUtils.compareBinary` to compare two byte array, however it's slow since it compares byte array using unsigned int comparison byte by bye. We can compare them using `Platform.getLong` with unsigned long comparison if they have more than 8 bytes. And here is some histroy about this `TODO` https://github.com/apache/spark/pull/6755/files#r32197461 The benchmark result should be same with `UTF8String`, can be found in #19180 (#19180 (comment)) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Move test from `TypeUtilsSuite` to `ByteArraySuite` Closes #34310 from ulysses-you/SPARK-37037. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-37341][SQL] Avoid unnecessary buffer and copy in full outer sort merge join ### What changes were proposed in this pull request? FULL OUTER sort merge join (non-code-gen path) [copies join keys and buffers input rows, even when rows from both sides do not have matched keys](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641). This is unnecessary, as we can just output the row with smaller join keys, and only buffer when both sides have matched keys. This would save us from unnecessary copy and buffer, when both join sides have a lot of rows not matched with each other. ### Why are the changes needed? Improve query performance for FULL OUTER sort merge join when code-gen is disabled. This would benefit query when both sides have a lot of rows not matched, and join key is big in terms of size (e.g. string type). Example micro benchmark: ``` def sortMergeJoin(): Unit = { val N = 2 << 20 codegenBenchmark("sort merge join", N) { val df1 = spark.range(N).selectExpr(s"cast(id * 15485863 as string) as k1") val df2 = spark.range(N).selectExpr(s"cast(id * 15485867 as string) as k2") val df = df1.join(df2, col("k1") === col("k2"), "full_outer") assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) df.noop() } } ``` Seeing run-time improvement over 60%: ``` Running benchmark: sort merge join Running case: sort merge join without optimization Stopped after 5 iterations, 10026 ms Running case: sort merge join with optimization Stopped after 5 iterations, 5954 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz sort merge join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ sort merge join without optimization 1807 2005 157 1.2 861.4 1.0X sort merge join with optimization 1135 1191 62 1.8 541.1 1.6X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests e.g. `OuterJoinSuite.scala`. Closes #34612 from c21/smj-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val ### What changes were proposed in this pull request? This PR adds caching to `LogicalPlan.isStreaming()`: the default implementation's result will now be cached in a `private lazy val`. ### Why are the changes needed? This improves the performance of the `DeduplicateRelations` analyzer rule. The default implementation of `isStreaming` recursively visits every node in the tree. `DeduplicateRelations.renewDuplicatedRelations` is recursively invoked on every node in the tree and each invocation calls `isStreaming`. This leads to `O(n^2)` invocations of `isStreaming` on leaf nodes. Caching `isStreaming` avoids this performance problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Correctness should be covered by existing tests. This significantly improved `DeduplicateRelations` performance in local microbenchmarking with large query plans (~20% reduction in that rule's runtime in one of my tests). Closes #34691 from JoshRosen/cache-LogicalPlan.isStreaming. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37530][CORE] Spark reads many paths very slow though newAPIHadoopFile ### What changes were proposed in this pull request? Same as #18441, we parallelize FileInputFormat.listStatus for newAPIHadoopFile ### Why are the changes needed? ![image](https://user-images.githubusercontent.com/8326978/144562490-d8005bf2-2052-4b50-9a5d-8b253ee598cc.png) Spark can be slow when accessing external storage at driver side, improve perf by parallelizing ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing GA Closes #34792 from yaooqinn/SPARK-37530. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> * [SPARK-37592][SQL] Improve performance of `JoinSelection` When I reading the implement of AQE, I find the process select join with hint exists a lot cumbersome code. The join hint has a relatively high learning curve for users, so the SQL not contains join hint in more cases. Improve performance of `JoinSelection` 'No'. Just change the inner implement. Jenkins test. Closes #34844 from beliefer/SPARK-37592-new. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37646][SQL] Avoid touching Scala reflection APIs in the lit function ### What changes were proposed in this pull request? This PR proposes to avoid touching Scala reflection APIs in the lit function. ### Why are the changes needed? Currently `lit` calls `typedlit[Any]` and touches Scala reflection APIs unnecessarily. As Scala reflection APIs touch multiple global locks and they are pretty slow when the parallelism is pretty high. This PR inlines `typedlit` to `lit` and replaces `Literal.create` with `Literal.apply` to avoid touching Scala reflection APIs. There is no behavior change. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - New unit tests. - Manually ran the test in https://issues.apache.org/jira/browse/SPARK-37646 and saw no difference between `new Column(Literal(0L))` and `lit(0L)`. Closes #34901 from zsxwing/SPARK-37646. Lead-authored-by: Shixiong Zhu <zsxwing@gmail.com> Co-authored-by: Shixiong Zhu <shixiong@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> * [SPARK-37689][SQL] Expand should be supported in PropagateEmptyRelation We meet a case that when there is a empty relation, HashAggregateExec still triggered to execute and return an empty result. It's not necessary. ![image](https://user-images.githubusercontent.com/46485123/146725110-27496536-f1f7-4fac-ae2c-0f6f81159bba.png) It's caused by there is an `Expand(EmptyLocalRelation())`, and it's not propagated, this pr support propagate `Expand` with empty LocalRelation Avoid unnecessary execution. No Added UT Closes #34954 from AngersZhuuuu/SPARK-37689. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36406][CORE] Avoid unnecessary file operations before delete a write failed file held by DiskBlockObjectWriter We always do file truncate operation before delete a write failed file held by `DiskBlockObjectWriter`, a typical process is as follows: ``` if (!success) { // This code path only happens if an exception was thrown above before we set success; // close our stuff and let the exception be thrown further writer.revertPartialWritesAndClose() if (file.exists()) { if (!file.delete()) { logWarning(s"Error deleting ${file}") } } } ``` The `revertPartialWritesAndClose` method will reverts writes that haven't been committed yet, but it doesn't seem necessary in the current scene. So this pr add a new method to `DiskBlockObjectWriter` named `closeAndDelete()`, the new method just revert write metrics and delete the write failed file. Avoid unnecessary file operations. Add a new method to `DiskBlockObjectWriter` named `closeAndDelete(). Pass the Jenkins or GitHub Action Closes #33628 from LuciferYang/SPARK-36406. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com> * [SPARK-37462][CORE] Avoid unnecessary calculating the number of outstanding fetch requests and RPCS Avoid unnecessary calculating the number of outstanding fetch requests and RPCS It is unnecessary to calculate the number of outstanding fetch requests and RPCS when the IdleStateEvent is not IDLE or the last request is not timeout. No. Exist unittests. Closes #34711 from weixiuli/SPARK-37462. Authored-by: weixiuli <weixiuli@jd.com> Signed-off-by: Sean Owen <srowen@gmail.com> Co-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Cheng Su <chengsu@fb.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Shixiong Zhu <zsxwing@gmail.com> Co-authored-by: Shixiong Zhu <shixiong@databricks.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: weixiuli <weixiuli@jd.com>

github-actions bot added the SQL label Oct 13, 2021

xkrogen reviewed Oct 13, 2021

View reviewed changes

srowen reviewed Oct 15, 2021

View reviewed changes

kiszk reviewed Oct 15, 2021

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java Outdated

Copy link

Member

kiszk Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 0 -> 1

Improve byte array sort perf by unify getPrefix function of UTF8Strin…

91ce4a5

…g and ByteArray

ulysses-you force-pushed the binary-prefix branch from 5b7fb3e to 91ce4a5 Compare October 15, 2021 08:00

srowen approved these changes Oct 15, 2021

View reviewed changes

JoshRosen approved these changes Oct 15, 2021

View reviewed changes

address comment

9fcc06b

kiszk approved these changes Oct 16, 2021

View reviewed changes

srowen closed this in f9cc7fb Oct 16, 2021

ulysses-you deleted the binary-prefix branch October 17, 2021 12:00

[SPARK-36992][SQL] Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray #34267

[SPARK-36992][SQL] Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray #34267

Uh oh!

Conversation

ulysses-you commented Oct 13, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk commented Oct 15, 2021

Uh oh!

ulysses-you commented Oct 15, 2021

Uh oh!

cloud-fan commented Oct 15, 2021

Uh oh!

SparkQA commented Oct 15, 2021

Uh oh!

SparkQA commented Oct 15, 2021

Uh oh!

SparkQA commented Oct 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen left a comment

Choose a reason for hiding this comment

Uh oh!

kiszk commented Oct 16, 2021

Uh oh!

SparkQA commented Oct 16, 2021

Uh oh!

SparkQA commented Oct 16, 2021

Uh oh!

SparkQA commented Oct 16, 2021

Uh oh!

kiszk left a comment

Choose a reason for hiding this comment

Uh oh!

srowen commented Oct 16, 2021

Uh oh!

ulysses-you commented Oct 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone