[SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd #25643

mgaido91 · 2019-09-01T12:05:07Z

What changes were proposed in this pull request?

The PR proposes to create a custom RDD which enables to propagate SQLConf also in cases not tracked by SQL execution, as it happens when a Dataset is converted to and RDD either using .rdd or .queryExecution.toRdd and then the returned RDD is used to invoke actions on it.

In this way, SQL configs are effective also in these cases, while earlier they were ignored.

Why are the changes needed?

Without this patch, all the times .rdd or .queryExecution.toRdd are used, all the SQL configs set are ignored. An example of a reproducer can be:

  withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") {
    val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
    df.createOrReplaceTempView("spark64kb")
    val data = spark.sql("select * from spark64kb limit 10")
    // Subexpression elimination is used here, despite it should have been disabled
    data.describe()
  }

Does this PR introduce any user-facing change?

When a user calls .queryExecution.toRdd, a SQLExecutionRDD is returned wrapping the RDD of the execute. When .rdd is used, an additional SQLExecutionRDD is present in the hierarchy.

How was this patch tested?

added UT

mgaido91 · 2019-09-01T12:05:15Z

cc @cloud-fan

SparkQA · 2019-09-01T15:02:41Z

Test build #109994 has finished for PR 25643 at commit 1c55d7c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SQLExecutionRDD(

cloud-fan · 2019-09-02T06:32:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLExecutionRDD.scala

+    // If we are in the context of a tracked SQL operation, `SQLExecution.EXECUTION_ID_KEY` is set
+    // and we have nothing to do here. Otherwise, we use the `SQLConf` captured at the creation of
+    // this RDD.
+    if (context.getLocalProperty("spark.sql.execution.id") == null) {


nit: use SQLExecution.EXECUTION_ID_KEY instead of hardcode

I cannot, because SQLExecution is in core and here we are in catalyst, so we are missing the dependency..

cloud-fan · 2019-09-02T06:34:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLExecutionRDD.scala

+    }
+  }
+
+  override def clearDependencies() {


is this really needed? sqlRDD is an constructor parameter which won't be kept as a member variable.

cloud-fan · 2019-09-02T06:34:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

@@ -105,7 +105,7 @@ class QueryExecution(
   * Given QueryExecution is not a public class, end users are discouraged to use this: please
   * use `Dataset.rdd` instead where conversion will be applied.
   */
-  lazy val toRdd: RDD[InternalRow] = executedPlan.execute()
+  lazy val toRdd: RDD[InternalRow] = new SQLExecutionRDD(executedPlan.execute(), SQLConf.get)


can we get the conf from the spark session?

cloud-fan · 2019-09-02T06:36:18Z

sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala

+}
+
+case class FakeQueryExecution(spark: SparkSession, physicalPlan: SparkPlan)
+ extends QueryExecution(spark, LocalRelation()) {


2 space indentation

cloud-fan · 2019-09-02T06:38:29Z

seems like a good idea, also cc @hvanhovell

hvanhovell · 2019-09-02T10:10:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLExecutionRDD.scala

+ * @param sqlRDD the `RDD` generated by the SQL plan
+ * @param conf the `SQLConf` to apply to the execution of the SQL plan
+ */
+class SQLExecutionRDD(


Why is this in catalyst and not in core?

thanks, I'm moving ti

SparkQA · 2019-09-02T11:24:45Z

Test build #110005 has finished for PR 25643 at commit 5674d6c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-02T16:50:55Z

Test build #110015 has finished for PR 25643 at commit 6baec8a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-05T13:33:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala

+    // this RDD.
+    if (context.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) == null) {
+      synchronized {
+        if (tasksRunning == 0) {


Just a question; in case that an executor with a single core has multiple assigned tasks, is this property setup executed per task?

yes, I'd say so.

tgravescs · 2019-09-05T15:55:17Z

I assume this only applies to programmatically set confs? If you set them on app launch they still apply?

when you do a bunch of dataframe/dataset operations and then do toRdd, it is definitely annoying we never indicate to the user catalyst stuff still happened (like SQL ui isn't present )

mgaido91 · 2019-09-05T20:27:42Z

@tgravescs all SQL configs which have been set whatever way are applied.

when you do a bunch of dataframe/dataset operations and then do toRdd, it is definitely annoying we never indicate to the user catalyst stuff still happened (like SQL ui isn't present )

yes, I quite agree on this. Unfortunately I don't have a good idea to solve this as the actions are performed outside the "SQL" world, ie. they are performed on the RDD by the user, so it's hard to track them within a SQL execution ID.... Honestly I have no good idea for that at the moment, might you have one, I think it would be great..

viirya · 2019-09-05T21:07:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala

+ * @param conf the `SQLConf` to apply to the execution of the SQL plan
+ */
+class SQLExecutionRDD(
+    var sqlRDD: RDD[InternalRow], @transient conf: SQLConf) extends RDD[InternalRow](sqlRDD) {


Do we need to pass in a SQLConf? I think we just always capture current SQLConf, so just private val sqlConfigs = SQLConf.get.getAllConfs?

I prefer the current way, as the SQLConf it is passed directly from the spark session as per @cloud-fan 's comment: #25643 (comment)

Ok. I'm fine with it.

maropu · 2019-09-05T23:24:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala

+    // and we have nothing to do here. Otherwise, we use the `SQLConf` captured at the creation of
+    // this RDD.
+    if (context.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) == null) {
+      SQLConf.withExistingConf(sqlConfExecutorSide) {


oh, it got better.

SparkQA · 2019-09-06T00:35:51Z

Test build #110200 has finished for PR 25643 at commit 860b658.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-09T06:37:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -115,7 +115,8 @@ object SQLConf {
   * Returns the active config object within the current scope. If there is an active SparkSession,
   * the proper SQLConf associated with the thread's active session is used. If it's called from
   * tasks in the executor side, a SQLConf will be created from job local properties, which are set
-   * and propagated from the driver side.
+   * and propagated from the driver side, unless a has been set in the scope by `withExistingConf`


a has been -> a SQLConf has been

SparkQA · 2019-09-09T11:26:29Z

Test build #110339 has finished for PR 25643 at commit 11848f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-09T13:20:44Z

thanks, merging to master!

mgaido91 · 2019-09-09T13:37:34Z

thanks @cloud-fan and thank you all for the reviews.

Just one question: what about backporting to 2.4?

cloud-fan · 2019-09-09T13:57:17Z

yea feel free to open a backport PR.

hvanhovell · 2019-09-09T14:22:38Z

@mgaido91 please add a feature flag for the backport. People might rely on the current 'buggy' behavior.

mgaido91 · 2019-09-09T15:46:39Z

@hvanhovell I feel weird to have such a flag only in Spark 2.4.* and not in 3.. I mean, if we introduce such flag we might want to have it in 3. too?

hvanhovell · 2019-09-09T15:57:25Z

@mgaido91 people currently using 2.4 in production might rely on the old behavior. So they should be able to move back to that if it breaks something for them. For 3.0 this is not an issue since it its not released yet and is under active development, in that case we are allowed to change behavior.

dongjoon-hyun · 2019-09-09T22:58:24Z

Hi, All.
This seems to break JDK11 build (compilation). I'm look at this. Please hold on a little bit backporting.

https://github.com/apache/spark/actions

dongjoon-hyun · 2019-09-09T23:03:20Z

This seems to be an old Scala bug on JDK9+.

Ambiguous reference when invoking Properties.putAll() in Java 11 scala/bug#10418

[ERROR] [Error] /Users/dongjoon/PRS/SPARK-JDK11-FIX/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala:42: ambiguous reference to overloaded definition,
both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit
and  method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit
match argument types (java.util.Map[String,String])

dongjoon-hyun · 2019-09-09T23:07:29Z

I'll make a follow-up soon.

dongjoon-hyun · 2019-09-09T23:17:22Z

#25738 is ready.

### What changes were proposed in this pull request? The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it. In this way, SQL configs are effective also in these cases, while earlier they were ignored. ### Why are the changes needed? Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be: ``` withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") { val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*) df.createOrReplaceTempView("spark64kb") val data = spark.sql("select * from spark64kb limit 10") // Subexpression elimination is used here, despite it should have been disabled data.describe() } ``` ### Does this PR introduce any user-facing change? When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy. ### How was this patch tested? added UT Closes apache#25643 from mgaido91/SPARK-28939. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd

1c55d7c

cloud-fan reviewed Sep 2, 2019

View reviewed changes

address comments

5674d6c

hvanhovell reviewed Sep 2, 2019

View reviewed changes

move class and fix

6baec8a

dongjoon-hyun added the SQL label Sep 2, 2019

maropu reviewed Sep 5, 2019

View reviewed changes

avoid using context properties

860b658

viirya reviewed Sep 5, 2019

View reviewed changes

maropu reviewed Sep 5, 2019

View reviewed changes

cloud-fan reviewed Sep 9, 2019

View reviewed changes

cloud-fan approved these changes Sep 9, 2019

View reviewed changes

address comment

11848f5

cloud-fan closed this in 3d6b33a Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd #25643

[SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd #25643

mgaido91 commented Sep 1, 2019

mgaido91 commented Sep 1, 2019

SparkQA commented Sep 1, 2019

cloud-fan Sep 2, 2019

mgaido91 Sep 2, 2019

cloud-fan Sep 2, 2019

cloud-fan Sep 2, 2019

cloud-fan Sep 2, 2019

cloud-fan commented Sep 2, 2019

hvanhovell Sep 2, 2019

mgaido91 Sep 2, 2019

SparkQA commented Sep 2, 2019

SparkQA commented Sep 2, 2019

maropu Sep 5, 2019

mgaido91 Sep 5, 2019

tgravescs commented Sep 5, 2019

mgaido91 commented Sep 5, 2019

viirya Sep 5, 2019 •

edited

Loading

mgaido91 Sep 6, 2019

viirya Sep 6, 2019

maropu Sep 5, 2019

SparkQA commented Sep 6, 2019

cloud-fan Sep 9, 2019

SparkQA commented Sep 9, 2019

cloud-fan commented Sep 9, 2019

mgaido91 commented Sep 9, 2019

cloud-fan commented Sep 9, 2019

hvanhovell commented Sep 9, 2019

mgaido91 commented Sep 9, 2019

hvanhovell commented Sep 9, 2019

dongjoon-hyun commented Sep 9, 2019 •

edited

Loading

dongjoon-hyun commented Sep 9, 2019

dongjoon-hyun commented Sep 9, 2019

dongjoon-hyun commented Sep 9, 2019

[SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd #25643

[SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd #25643

Conversation

mgaido91 commented Sep 1, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

mgaido91 commented Sep 1, 2019

SparkQA commented Sep 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 2, 2019

SparkQA commented Sep 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Sep 5, 2019

mgaido91 commented Sep 5, 2019

viirya Sep 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 6, 2019

Choose a reason for hiding this comment

SparkQA commented Sep 9, 2019

cloud-fan commented Sep 9, 2019

mgaido91 commented Sep 9, 2019

cloud-fan commented Sep 9, 2019

hvanhovell commented Sep 9, 2019

mgaido91 commented Sep 9, 2019

hvanhovell commented Sep 9, 2019

dongjoon-hyun commented Sep 9, 2019 • edited Loading

dongjoon-hyun commented Sep 9, 2019

dongjoon-hyun commented Sep 9, 2019

dongjoon-hyun commented Sep 9, 2019

viirya Sep 5, 2019 •

edited

Loading

dongjoon-hyun commented Sep 9, 2019 •

edited

Loading