[SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comment #20419

kiszk · 2018-01-29T00:48:35Z

What changes were proposed in this pull request?

This PR always adds codegenStageId in comment of the generated class.

/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=1
/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
...

How was this patch tested?

Existing tests

kiszk · 2018-01-29T00:49:19Z

ping @rednaxelafx

viirya · 2018-01-29T01:26:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala

        s"""Codegend pipeline for stage (id=$codegenStageId)
           |${this.treeString.trim}""".stripMargin)}
      final class $className extends ${classOf[BufferedRowIterator].getName} {
+        ${ctx.registerComment(s"codegenStageId=$codegenStageId", true)}


Instead of putting this comment inside the class, how about let it on top of it? Like:

/* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=1 /* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; ...

SparkQA · 2018-01-29T04:19:34Z

Test build #86745 has finished for PR 20419 at commit 6ee9106.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rednaxelafx · 2018-01-29T06:58:58Z

LGTM, and +1 on @viirya 's idea. I like it better for the comment to be on top of the class declaration instead of inside it; but I'm okay either way if others have strong opinion otherwise. As long as the comment line is right next to the class declaration line, that's good enough for easy pattern matching.

BTW, to save a few characters from the generated code, maybe we don't need to generate this comment when the spark.sql.codegen.useIdInClassName is turned on, since the codegenStageId is in the class name already.
But for the ease of having a unified way to access the ID from the comment, I like having this comment always there. I guess @kiszk and @viirya are on the same side on this one for always having the comment, right? In that case I'd agree and let's get this in :-)

kiszk · 2018-01-29T17:52:15Z

I prefer to leave the comment regardless of the spark.sql.codegen.useIdInClassName for a unified way to access the ID from the comment.

rednaxelafx · 2018-01-29T19:33:11Z

@kiszk SGTM and LGTM. Let's ship it!

One more question on the side: with the forceComment = true, are we fully sure that won't affect the equality of CodeAndComment?
The whole point of this PR is to have a way to embed the ID into the non-executable code of the generated code so that it won't affect the codegen cache hit, right?

Could you please post an example of either the generated code or the metrics, e.g. a SortMergeJoin on two identical spark.range(3)s, and confirm that even when the two range()s are codegen's into different codegenStageIds, with the spark.sql.codegen.useIdInClassName turned off, the two stages would still hit the codegen cache? Basically to verify the example I gave in #20224 (comment) still hits the codegen cache when spark.sql.codegen.useIdInClassName=false.

SparkQA · 2018-01-29T21:12:18Z

Test build #86774 has finished for PR 20419 at commit 4a09eac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-31T04:31:56Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

   * Register a comment and return the corresponding place holder
   */
-  def registerComment(text: => String): String = {
+  def registerComment(text: => String, forceComment: Boolean = false): String = {


Nit: rename it to force?

Also adding the text?

@param force whether to force registering the comments

gatorsmile · 2018-01-31T04:36:36Z

@rednaxelafx Does the following codes address your concern?

rednaxelafx · 2018-01-31T04:46:22Z

@gatorsmile Not directly. The CodeAndComment case class is just a "container", it doesn't handle what gets into the body field. When we force embed a comment, it'll leave a comment as a placeholder in the generated code, which is in the body (the actual comment contents are in the comment map). I was just curious if any reviews knows off the top of their heads whether or not this placeholder may affect equality.

gatorsmile · 2018-01-31T05:31:48Z

Then, we should add a test case to ensure it will not be broken.

SparkQA · 2018-01-31T18:09:58Z

Test build #86882 has finished for PR 20419 at commit f07dc2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rednaxelafx · 2018-01-31T23:35:44Z

I did the experiment that I asked about at #20419 (comment) , and verified that under the current implementation, this PR will not affect the codegen cache hit behavior.

But the way it's working is a bit brittle. The body field of the CodeAndComment returned from WholeStageCodegenExec.doCodeGen() looks like the following, when spark.sql.codegen.comments=false && spark.sql.codegen.useIdInClassName=false:

/*wholestagecodegen_c*/
final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {

Notice the /*wholestagecodegen_c*/ placeholder comment. It's the artifact produced by the ctx.registerComment() call, where wholestagecodegen_ is a prefix automatically derived from the calling class, and c is from a ctx.freshName("c"), subject to an integer suffix when names collide.

The way it's used right now, because from WholeStageCodegenExec we're only making very few calls to ctx.registerComment() and all of them happen in a deterministic order, that's how this identifier embedded in the placeholder comment would be the same no matter what codegen stage it is.

But let's assume if someone doesn't know about this, and then accidentally added code in WholeStageCodegenExec that conditionally calls ctx.registerComment(), then this placeholder comment is NOT guaranteed to be the same, and that will affect the codegen cache hit behavior.

I'd say functional wise this PR is good to go; maybe we want to add a comment somewhere in WholeStageCodegenExec to note the subtlety, or maybe think about an alternative sometime in the future.

rednaxelafx · 2018-02-01T01:12:12Z

I gave it a bit more thought. Here's an alternative proposal: instead of using a "force comment" mechanism in the current form (which still gets a ctx.freshName("c") that the caller has no control over), why don't we provide an alternative API that also forces comment, but allows the caller to specify a (within the CodegenContext) unique ID for the placeholder comment?

In this case, instead of using /*wholestagecodegen_c*/ as the placeholder, which can be brittle if someone adds new calls to ctx.registerComment() in WholeStageCodegenExec, we can call ctx.registerComment(comment, placeholderId = "wsc_codegenStageId"), which we can be sure that it'll end up creating a placeholder comment of /*wsc_codegenStageId*/, stable across all whole-stage generated main classes, that would never affect the codegen cache behavior.

@kiszk WDYT?

kiszk · 2018-02-01T17:02:13Z

@rednaxelafx I understand your concern when ctx.registerComment() is conditionally called.

If we call ctx.registerComment() with the specific identified multiple times, how do we handle? (e.g. Expression.genCode() may be called multiple times from different contexts.

I am sorry that I will slowly response next few days due to a short vacation.

rednaxelafx · 2018-02-01T20:21:09Z

@kiszk For this specific kind of usage, I don't think using a hardcoded stable ID will be a problem.
The comment we're talking about is the kind the can only appear once in a single whole-stage codegen unit -- there's only one GeneratedIterator class with the processNext() method as the main entry point. Providing such a mechanism allows codegen developers to mark important information without risking to affect codegen cache behavior.

For safety, we can add a runtime check. Let's assume this new method is called ctx.registerCommentWithId(), then inside this method we can implement something very similar to ctx.freshName(), but instead of producing a new name with a potential integer suffix appended, we can either log a warning message or throw an exception. That way the developer would have an easy way to catch this.

I am sorry that I will slowly response next few days due to a short vacation.

Have fun on your vacation!

SparkQA · 2018-02-05T21:28:51Z

Test build #87076 has finished for PR 20419 at commit cb7a16b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-06T02:07:44Z

Retest this please

SparkQA · 2018-02-06T05:24:40Z

Test build #87084 has finished for PR 20419 at commit cb7a16b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-07T01:27:44Z

ping @rednaxelafx

kiszk · 2018-02-13T02:07:17Z

ping @rednaxelafx

rednaxelafx

LGTM with minor nit. Thanks @kiszk !

Sorry for the delay of reviewing this PR. I got caught up in some other projects that fully consumed my bandwidth.

rednaxelafx · 2018-02-13T06:59:22Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

  /**
   * Register a comment and return the corresponding place holder
+   *
+   * @param placeholderId a string for a place holder


Nit: can we rephrase this ScalaDoc a bit, maybe like:

/** * ... * @param placeholderId an optionally specified identifier for the comment's placeholder. The caller should make sure this identifier is unique within the compilation unit. If this argument is not specified, a fresh identifier will be automatically created and used as the placeholder. * ... */

cloud-fan · 2018-02-13T12:49:20Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

-    if (SparkEnv.get != null && SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) {
-      val name = freshName("c")
+    if (force ||
+      SparkEnv.get != null && SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) {


unrelated, but should this be a SQL conf?

Now, the question is whether we can use SQLConf.get in this case. If not, we might need to keep it like this

IIUC, the answer seems to be No. See this discussion.

gatorsmile · 2018-02-13T17:55:32Z

@kiszk This is unable to merge to 2.3. Could you open a new JIRA for this?

kiszk · 2018-02-13T18:30:31Z

I see. I will open a new JIRA tomorrow.

SparkQA · 2018-02-13T19:29:12Z

Test build #87405 has finished for PR 20419 at commit 0ad1e72.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

initial commit

6ee9106

viirya reviewed Jan 29, 2018

View reviewed changes

address review comment

4a09eac

gatorsmile reviewed Jan 31, 2018

View reviewed changes

address review comment

f07dc2d

address review comment

cb7a16b

rednaxelafx reviewed Feb 13, 2018

View reviewed changes

cloud-fan reviewed Feb 13, 2018

View reviewed changes

address review comment

0ad1e72

kiszk mentioned this pull request Feb 14, 2018

[SPARK-23424][SQL]Add codegenStageId in comment #20612

Closed

asfgit closed this in 95e25ed Feb 21, 2018

[SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comment #20419

[SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comment #20419

Uh oh!

Conversation

kiszk commented Jan 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

kiszk commented Jan 29, 2018

Uh oh!

viirya Jan 29, 2018

Choose a reason for hiding this comment

Uh oh!

kiszk Jan 29, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 29, 2018

Uh oh!

rednaxelafx commented Jan 29, 2018

Uh oh!

kiszk commented Jan 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rednaxelafx commented Jan 29, 2018

Uh oh!

SparkQA commented Jan 29, 2018

Uh oh!

gatorsmile Jan 31, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jan 31, 2018

Uh oh!

rednaxelafx commented Jan 31, 2018

Uh oh!

gatorsmile commented Jan 31, 2018

Uh oh!

SparkQA commented Jan 31, 2018

Uh oh!

rednaxelafx commented Jan 31, 2018

Uh oh!

rednaxelafx commented Feb 1, 2018

Uh oh!

kiszk commented Feb 1, 2018

Uh oh!

rednaxelafx commented Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 5, 2018

Uh oh!

kiszk commented Feb 6, 2018

Uh oh!

SparkQA commented Feb 6, 2018

Uh oh!

kiszk commented Feb 7, 2018

Uh oh!

kiszk commented Feb 13, 2018

Uh oh!

rednaxelafx left a comment

Choose a reason for hiding this comment

Uh oh!

rednaxelafx Feb 13, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 13, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Feb 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Feb 13, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Feb 13, 2018

Uh oh!

kiszk commented Feb 13, 2018

Uh oh!

SparkQA commented Feb 13, 2018

Uh oh!

kiszk commented Jan 29, 2018 •

edited

Loading

kiszk commented Jan 29, 2018 •

edited

Loading

rednaxelafx commented Feb 1, 2018 •

edited

Loading

gatorsmile Feb 13, 2018 •

edited

Loading