-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34026][SQL] Inject repartition and sort nodes to satisfy required distribution and ordering #31083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34026][SQL] Inject repartition and sort nodes to satisfy required distribution and ordering #31083
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.execution.datasources.v2 | ||
|
|
||
| import org.apache.spark.sql.AnalysisException | ||
| import org.apache.spark.sql.catalyst.analysis.Resolver | ||
| import org.apache.spark.sql.catalyst.expressions.{Ascending, Descending, Expression, NamedExpression, NullOrdering, NullsFirst, NullsLast, SortDirection, SortOrder} | ||
| import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, RepartitionByExpression, Sort} | ||
| import org.apache.spark.sql.connector.distributions.{ClusteredDistribution, OrderedDistribution, UnspecifiedDistribution} | ||
| import org.apache.spark.sql.connector.expressions.{Expression => V2Expression, FieldReference, IdentityTransform, NullOrdering => V2NullOrdering, SortDirection => V2SortDirection, SortValue} | ||
| import org.apache.spark.sql.connector.write.{RequiresDistributionAndOrdering, Write} | ||
| import org.apache.spark.sql.internal.SQLConf | ||
|
|
||
| object DistributionAndOrderingUtils { | ||
|
|
||
| def prepareQuery(write: Write, query: LogicalPlan, conf: SQLConf): LogicalPlan = write match { | ||
| case write: RequiresDistributionAndOrdering => | ||
| val resolver = conf.resolver | ||
|
|
||
| val distribution = write.requiredDistribution match { | ||
| case d: OrderedDistribution => | ||
| d.ordering.map(e => toCatalyst(e, query, resolver)) | ||
| case d: ClusteredDistribution => | ||
| d.clustering.map(e => toCatalyst(e, query, resolver)) | ||
| case _: UnspecifiedDistribution => | ||
| Array.empty[Expression] | ||
| } | ||
|
|
||
| val queryWithDistribution = if (distribution.nonEmpty) { | ||
| val numShufflePartitions = conf.numShufflePartitions | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like a limitation for data sources; for DSv1 they could inject any arbitrary operations, including calling repartition method Dataset provides. Most repartition methods have a parameter "numPartitions". Same for repartitionByRange methods.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example, I have a data source written as DSv1 which provides ability to read the state store in streaming query and rewrite it. While the number of partitions in state store is determined by the number of shuffles in streaming query, the value is not guaranteed to be same across applications. Furthermore, the data source supports rescale which should repartition to arbitrary number of partitions. It would be weird if I have to say "You should change the Spark configuration to set the target number of partitions."
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have a commit to address this (HeartSaVioR@485c56b), but you deserve the full credit of this PR and I wouldn't like to take a part of your credit. That said, I prefer to handle this in follow-up JIRA issue (I'll submit a new PR once this PR is merged), but I'm also OK to address this altogether in this PR (I'll submit a new PR to your repo) if we prefer it.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @HeartSaVioR, I think it is going to be useful for some data sources. I did not want to cover this in first PRs as there was no consensus on the dev list around controlling the number of tasks. That's why I added this topic to non-goals of the design doc. I think one point to think about is who should control the parallelism. I guess the parallelism should depend on incoming data volume in most cases (except when the number of requested partitions is static, like probably in the case mentioned above). Without having statistics about the number of incoming records or their shape, it will be hard for a data source to determine the right number of partitions. That being said, I think making that number optional like in your change can be a reasonable starting point. However, I'd like us to think about how this will look like in the future. Should Spark report stats about the incoming batch so that data sources can make a better estimate? How will that API look like?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @aokolnychyi that it should be Spark to decide these physical details (like numShufflePartitions), for better performance. It's an ill-pattern to let the data source to decide it. BTW why do we use
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess I missed the change from @viirya that made it optional, @cloud-fan. I can switch to that.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My concern is mostly the "static partitions" as I provided as an example like state data source. That's not a matter of whether it's ill-pattern or not, because for the case the ability of restricting the number of partitions is not optional but "required" - the data should be partitioned exactly the same with Spark partitions the rows for hash shuffle, and a partition shouldn't be written concurrently. I don't think end users should do the repartition manually in their queries to not break a thing. That is easily achievable in DSv1 (I have an implementation based on DSv1 and want to migrate to DSv2) as Spark provides DataFrame to the data source on write. While I don't expect such flexibility for DSv2 (the behavior seems too open), I'm not sure the case is something we'd like to define as "not supported on DSv2 and have to live with DSv1".
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even without the static partitioning, there might be something more to think of - data source may know about the physical information of the actual storage which may be some points on optimizing writes, or on opposite way, throttle on parallelism to not doing effective DDOS by ourselves. I guess it's beyond of the scope on this PR, but just wanted to bring this as a food for thought.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I knew this would require discussion so I propose to do this in a follow-up like @HeartSaVioR suggested. I don’t feel strongly about this point and it would be great to accommodate all use cases to drive the adoption of DS V2. However, we should be really careful and I think we should continue to discuss. I still believe the parallelism should depend on data volume in a general case. That’s why it is going to be useful only if Spark propagates stats about the incoming batch. |
||
| // the conversion to catalyst expressions above produces SortOrder expressions | ||
| // for OrderedDistribution and generic expressions for ClusteredDistribution | ||
| // this allows RepartitionByExpression to pick either range or hash partitioning | ||
| RepartitionByExpression(distribution, query, numShufflePartitions) | ||
|
||
| } else { | ||
| query | ||
| } | ||
|
|
||
| val ordering = write.requiredOrdering.toSeq | ||
| .map(e => toCatalyst(e, query, resolver)) | ||
| .asInstanceOf[Seq[SortOrder]] | ||
|
|
||
| val queryWithDistributionAndOrdering = if (ordering.nonEmpty) { | ||
| Sort(ordering, global = false, queryWithDistribution) | ||
| } else { | ||
| queryWithDistribution | ||
| } | ||
|
|
||
| queryWithDistributionAndOrdering | ||
|
|
||
| case _ => | ||
| query | ||
| } | ||
|
|
||
| private def toCatalyst( | ||
| expr: V2Expression, | ||
| query: LogicalPlan, | ||
| resolver: Resolver): Expression = { | ||
|
|
||
| // we cannot perform the resolution in the analyzer since we need to optimize expressions | ||
| // in nodes like OverwriteByExpression before constructing a logical write | ||
| def resolve(ref: FieldReference): NamedExpression = { | ||
| query.resolve(ref.parts, resolver) match { | ||
| case Some(attr) => attr | ||
| case None => throw new AnalysisException(s"Cannot resolve '$ref' using ${query.output}") | ||
| } | ||
| } | ||
|
|
||
| expr match { | ||
| case SortValue(child, direction, nullOrdering) => | ||
| val catalystChild = toCatalyst(child, query, resolver) | ||
| SortOrder(catalystChild, toCatalyst(direction), toCatalyst(nullOrdering), Seq.empty) | ||
| case IdentityTransform(ref) => | ||
|
||
| resolve(ref) | ||
| case ref: FieldReference => | ||
| resolve(ref) | ||
| case _ => | ||
| throw new AnalysisException(s"$expr is not currently supported") | ||
| } | ||
| } | ||
|
|
||
| private def toCatalyst(direction: V2SortDirection): SortDirection = direction match { | ||
| case V2SortDirection.ASCENDING => Ascending | ||
| case V2SortDirection.DESCENDING => Descending | ||
| } | ||
|
|
||
| private def toCatalyst(nullOrdering: V2NullOrdering): NullOrdering = nullOrdering match { | ||
| case V2NullOrdering.NULLS_FIRST => NullsFirst | ||
| case V2NullOrdering.NULLS_LAST => NullsLast | ||
| } | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.