-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9066][SQL] Improve cartesian performance #7417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
0a62098
61d1a7e
23deb4b
eb9d155
8198648
1006d46
f0ce447
2bc0991
547242e
bca7a07
a168900
99bcde7
4310536
b2a0ae8
5ca1d26
04678d1
f1cebae
8a8658c
60f2102
e01c8f0
dd77444
d9aef91
a66f475
9812242
ce6ad25
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,6 +26,7 @@ import org.apache.spark.sql.catalyst.plans.logical.{BroadcastHint, LogicalPlan} | |
| import org.apache.spark.sql.catalyst.plans.physical._ | ||
| import org.apache.spark.sql.columnar.{InMemoryColumnarTableScan, InMemoryRelation} | ||
| import org.apache.spark.sql.execution.datasources.{CreateTableUsing, CreateTempTableUsing, DescribeCommand => LogicalDescribeCommand, _} | ||
| import org.apache.spark.sql.execution.joins.BuildSide | ||
| import org.apache.spark.sql.execution.{DescribeCommand => RunnableDescribeCommand} | ||
| import org.apache.spark.sql.types._ | ||
| import org.apache.spark.sql.{SQLContext, Strategy, execution} | ||
|
|
@@ -274,12 +275,30 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] { | |
| } | ||
|
|
||
| object CartesianProduct extends Strategy { | ||
| def getSmallSide(left: LogicalPlan, right: LogicalPlan): BuildSide = { | ||
| if (right.statistics.sizeInBytes < left.statistics.sizeInBytes) { | ||
| joins.BuildRight | ||
| } else { | ||
| joins.BuildLeft | ||
| } | ||
| } | ||
|
|
||
| def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { | ||
| // If plan can broadcast we use BroadcastNestedLoopJoin, as we know for inner join with true | ||
| // condition is same as Cartesian. | ||
| case logical.Join(CanBroadcast(left), right, joinType, condition) => | ||
| execution.joins.BroadcastNestedLoopJoin( | ||
| planLater(left), planLater(right), joins.BuildLeft, joinType, condition) :: Nil | ||
| case logical.Join(left, CanBroadcast(right), joinType, condition) => | ||
| execution.joins.BroadcastNestedLoopJoin( | ||
| planLater(left), planLater(right), joins.BuildRight, joinType, condition) :: Nil | ||
| case logical.Join(left, right, _, None) => | ||
| execution.joins.CartesianProduct(planLater(left), planLater(right)) :: Nil | ||
| execution.joins.CartesianProduct(planLater(left), planLater(right), | ||
| getSmallSide(left, right)) :: Nil | ||
| case logical.Join(left, right, Inner, Some(condition)) => | ||
| execution.Filter(condition, | ||
| execution.joins.CartesianProduct(planLater(left), planLater(right))) :: Nil | ||
| execution.joins.CartesianProduct(planLater(left), planLater(right), | ||
| getSmallSide(left, right))) :: Nil | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. instead of passing a
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually I am a little concern about the side switch based on the statistic, as I commented previously. And also as @cloud-fan comment out:
What we actually cared is the Probably we'd better add more statistic info says partition number logical plan or average file size of each partition, and in order not to make confusing for the further improvement, I think we'd better remove this optimization rule for cartesian join. And that's why I didn't do that at #8652 What do you think?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point! This optimization should depend on record numbers, not data size. |
||
| case _ => Nil | ||
| } | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,7 +28,10 @@ import org.apache.spark.sql.execution.metric.SQLMetrics | |
| * :: DeveloperApi :: | ||
| */ | ||
| @DeveloperApi | ||
| case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNode { | ||
| case class CartesianProduct( | ||
| left: SparkPlan, | ||
| right: SparkPlan, | ||
| buildSide: BuildSide) extends BinaryNode { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not so sure if the change necessary, as if either side of the table is small enough, we will resort to the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Sephiroth-Lin Can you explain the reason that we need
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yhuai use buildSide just want to know which side is small, and use this to decide whether we need to change the order. |
||
| override def output: Seq[Attribute] = left.output ++ right.output | ||
|
|
||
| override private[sql] lazy val metrics = Map( | ||
|
|
@@ -50,11 +53,25 @@ case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNod | |
| row.copy() | ||
| } | ||
|
|
||
| leftResults.cartesian(rightResults).mapPartitions { iter => | ||
| val (smallResults, bigResults) = buildSide match { | ||
| case BuildRight => (rightResults, leftResults) | ||
| case BuildLeft => (leftResults, rightResults) | ||
| } | ||
|
|
||
| // Use the small size rdd as cartesian left rdd. | ||
| smallResults.cartesian(bigResults).mapPartitions { iter => | ||
| val joinedRow = new JoinedRow | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Quick question. Why not use
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, use partition size here is not accurate, see a rdd with 100 partitions, and each partition has one record and a rdd with 10 partition and each partition has 100 million records, use the method above will cause more scan from hdfs
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @hvanhovell Yes, use sizeInBytes is better, but also have a problem, if leftResults only have 1 record and this record size are big, and rightResults have many records and these records total size are small, then at this scenario will cause worse performance. The best way is we check the total records for the partition, but now we can not get it. |
||
| iter.map { r => | ||
| numOutputRows += 1 | ||
| joinedRow(r._1, r._2) | ||
| buildSide match { | ||
| case BuildLeft => | ||
| iter.map { r => | ||
| numOutputRows += 1 | ||
| joinedRow(r._1, r._2) | ||
| } | ||
| case BuildRight => | ||
| iter.map { r => | ||
| numOutputRows += 1 | ||
| joinedRow(r._2, r._1) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure comparing the total size will give us a very useful heuristics because the size of a table does not imply anything about the size of a partition. Also, I am inclined to get https://github.com/apache/spark/pull/8652/files merged first and then add any other optimization in a follow-up PR. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @yhuai and @chenghao-intel. Just would like to provide a concrete example: compression ratio of Parquet can be easily ten times of JSON. When joining a Parquet table and a JSON table, you may even end up with a much worse situation.