Skip to content

[SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd #25643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Sep 1, 2019

What changes were proposed in this pull request?

The PR proposes to create a custom RDD which enables to propagate SQLConf also in cases not tracked by SQL execution, as it happens when a Dataset is converted to and RDD either using .rdd or .queryExecution.toRdd and then the returned RDD is used to invoke actions on it.

In this way, SQL configs are effective also in these cases, while earlier they were ignored.

Why are the changes needed?

Without this patch, all the times .rdd or .queryExecution.toRdd are used, all the SQL configs set are ignored. An example of a reproducer can be:

  withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") {
    val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
    df.createOrReplaceTempView("spark64kb")
    val data = spark.sql("select * from spark64kb limit 10")
    // Subexpression elimination is used here, despite it should have been disabled
    data.describe()
  }

Does this PR introduce any user-facing change?

When a user calls .queryExecution.toRdd, a SQLExecutionRDD is returned wrapping the RDD of the execute. When .rdd is used, an additional SQLExecutionRDD is present in the hierarchy.

How was this patch tested?

added UT

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 1, 2019

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Sep 1, 2019

Test build #109994 has finished for PR 25643 at commit 1c55d7c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SQLExecutionRDD(

// If we are in the context of a tracked SQL operation, `SQLExecution.EXECUTION_ID_KEY` is set
// and we have nothing to do here. Otherwise, we use the `SQLConf` captured at the creation of
// this RDD.
if (context.getLocalProperty("spark.sql.execution.id") == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use SQLExecution.EXECUTION_ID_KEY instead of hardcode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot, because SQLExecution is in core and here we are in catalyst, so we are missing the dependency..

}
}

override def clearDependencies() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really needed? sqlRDD is an constructor parameter which won't be kept as a member variable.

@@ -105,7 +105,7 @@ class QueryExecution(
* Given QueryExecution is not a public class, end users are discouraged to use this: please
* use `Dataset.rdd` instead where conversion will be applied.
*/
lazy val toRdd: RDD[InternalRow] = executedPlan.execute()
lazy val toRdd: RDD[InternalRow] = new SQLExecutionRDD(executedPlan.execute(), SQLConf.get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get the conf from the spark session?

}

case class FakeQueryExecution(spark: SparkSession, physicalPlan: SparkPlan)
extends QueryExecution(spark, LocalRelation()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 space indentation

@cloud-fan
Copy link
Contributor

seems like a good idea, also cc @hvanhovell

* @param sqlRDD the `RDD` generated by the SQL plan
* @param conf the `SQLConf` to apply to the execution of the SQL plan
*/
class SQLExecutionRDD(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in catalyst and not in core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I'm moving ti

@SparkQA
Copy link

SparkQA commented Sep 2, 2019

Test build #110005 has finished for PR 25643 at commit 5674d6c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2019

Test build #110015 has finished for PR 25643 at commit 6baec8a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// this RDD.
if (context.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) == null) {
synchronized {
if (tasksRunning == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question; in case that an executor with a single core has multiple assigned tasks, is this property setup executed per task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'd say so.

@tgravescs
Copy link
Contributor

I assume this only applies to programmatically set confs? If you set them on app launch they still apply?

when you do a bunch of dataframe/dataset operations and then do toRdd, it is definitely annoying we never indicate to the user catalyst stuff still happened (like SQL ui isn't present )

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 5, 2019

@tgravescs all SQL configs which have been set whatever way are applied.

when you do a bunch of dataframe/dataset operations and then do toRdd, it is definitely annoying we never indicate to the user catalyst stuff still happened (like SQL ui isn't present )

yes, I quite agree on this. Unfortunately I don't have a good idea to solve this as the actions are performed outside the "SQL" world, ie. they are performed on the RDD by the user, so it's hard to track them within a SQL execution ID.... Honestly I have no good idea for that at the moment, might you have one, I think it would be great..

* @param conf the `SQLConf` to apply to the execution of the SQL plan
*/
class SQLExecutionRDD(
var sqlRDD: RDD[InternalRow], @transient conf: SQLConf) extends RDD[InternalRow](sqlRDD) {
Copy link
Member

@viirya viirya Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass in a SQLConf? I think we just always capture current SQLConf, so just private val sqlConfigs = SQLConf.get.getAllConfs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the current way, as the SQLConf it is passed directly from the spark session as per @cloud-fan 's comment: #25643 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm fine with it.

// and we have nothing to do here. Otherwise, we use the `SQLConf` captured at the creation of
// this RDD.
if (context.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) == null) {
SQLConf.withExistingConf(sqlConfExecutorSide) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it got better.

@SparkQA
Copy link

SparkQA commented Sep 6, 2019

Test build #110200 has finished for PR 25643 at commit 860b658.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -115,7 +115,8 @@ object SQLConf {
* Returns the active config object within the current scope. If there is an active SparkSession,
* the proper SQLConf associated with the thread's active session is used. If it's called from
* tasks in the executor side, a SQLConf will be created from job local properties, which are set
* and propagated from the driver side.
* and propagated from the driver side, unless a has been set in the scope by `withExistingConf`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a has been -> a SQLConf has been

@SparkQA
Copy link

SparkQA commented Sep 9, 2019

Test build #110339 has finished for PR 25643 at commit 11848f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3d6b33a Sep 9, 2019
@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 9, 2019

thanks @cloud-fan and thank you all for the reviews.

Just one question: what about backporting to 2.4?

@cloud-fan
Copy link
Contributor

yea feel free to open a backport PR.

@hvanhovell
Copy link
Contributor

@mgaido91 please add a feature flag for the backport. People might rely on the current 'buggy' behavior.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 9, 2019

@hvanhovell I feel weird to have such a flag only in Spark 2.4.* and not in 3.. I mean, if we introduce such flag we might want to have it in 3. too?

@hvanhovell
Copy link
Contributor

@mgaido91 people currently using 2.4 in production might rely on the old behavior. So they should be able to move back to that if it breaks something for them. For 3.0 this is not an issue since it its not released yet and is under active development, in that case we are allowed to change behavior.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 9, 2019

Hi, All.
This seems to break JDK11 build (compilation). I'm look at this. Please hold on a little bit backporting.

@dongjoon-hyun
Copy link
Member

This seems to be an old Scala bug on JDK9+.

[ERROR] [Error] /Users/dongjoon/PRS/SPARK-JDK11-FIX/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala:42: ambiguous reference to overloaded definition,
both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit
and  method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit
match argument types (java.util.Map[String,String])

@dongjoon-hyun
Copy link
Member

I'll make a follow-up soon.

@dongjoon-hyun
Copy link
Member

#25738 is ready.

PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
### What changes were proposed in this pull request?

The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it.

In this way, SQL configs are effective also in these cases, while earlier they were ignored.

### Why are the changes needed?

Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be:
```
  withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") {
    val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
    df.createOrReplaceTempView("spark64kb")
    val data = spark.sql("select * from spark64kb limit 10")
    // Subexpression elimination is used here, despite it should have been disabled
    data.describe()
  }
```

### Does this PR introduce any user-facing change?

When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy.

### How was this patch tested?

added UT

Closes apache#25643 from mgaido91/SPARK-28939.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants