Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Jan 29, 2018

What changes were proposed in this pull request?

This PR always adds codegenStageId in comment of the generated class.

/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=1
/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
...

How was this patch tested?

Existing tests

@kiszk
Copy link
Member Author

kiszk commented Jan 29, 2018

ping @rednaxelafx

s"""Codegend pipeline for stage (id=$codegenStageId)
|${this.treeString.trim}""".stripMargin)}
final class $className extends ${classOf[BufferedRowIterator].getName} {
${ctx.registerComment(s"codegenStageId=$codegenStageId", true)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of putting this comment inside the class, how about let it on top of it? Like:

/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=1
/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done

@SparkQA
Copy link

SparkQA commented Jan 29, 2018

Test build #86745 has finished for PR 20419 at commit 6ee9106.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rednaxelafx
Copy link
Contributor

LGTM, and +1 on @viirya 's idea. I like it better for the comment to be on top of the class declaration instead of inside it; but I'm okay either way if others have strong opinion otherwise. As long as the comment line is right next to the class declaration line, that's good enough for easy pattern matching.

BTW, to save a few characters from the generated code, maybe we don't need to generate this comment when the spark.sql.codegen.useIdInClassName is turned on, since the codegenStageId is in the class name already.
But for the ease of having a unified way to access the ID from the comment, I like having this comment always there. I guess @kiszk and @viirya are on the same side on this one for always having the comment, right? In that case I'd agree and let's get this in :-)

@kiszk
Copy link
Member Author

kiszk commented Jan 29, 2018

I prefer to leave the comment regardless of the spark.sql.codegen.useIdInClassName for a unified way to access the ID from the comment.

@rednaxelafx
Copy link
Contributor

@kiszk SGTM and LGTM. Let's ship it!

One more question on the side: with the forceComment = true, are we fully sure that won't affect the equality of CodeAndComment?
The whole point of this PR is to have a way to embed the ID into the non-executable code of the generated code so that it won't affect the codegen cache hit, right?

Could you please post an example of either the generated code or the metrics, e.g. a SortMergeJoin on two identical spark.range(3)s, and confirm that even when the two range()s are codegen's into different codegenStageIds, with the spark.sql.codegen.useIdInClassName turned off, the two stages would still hit the codegen cache? Basically to verify the example I gave in #20224 (comment) still hits the codegen cache when spark.sql.codegen.useIdInClassName=false.

@SparkQA
Copy link

SparkQA commented Jan 29, 2018

Test build #86774 has finished for PR 20419 at commit 4a09eac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Register a comment and return the corresponding place holder
*/
def registerComment(text: => String): String = {
def registerComment(text: => String, forceComment: Boolean = false): String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename it to force?

Also adding the text?

@param force whether to force registering the comments

@gatorsmile
Copy link
Member

@rednaxelafx Does the following codes address your concern?

@rednaxelafx
Copy link
Contributor

@gatorsmile Not directly. The CodeAndComment case class is just a "container", it doesn't handle what gets into the body field. When we force embed a comment, it'll leave a comment as a placeholder in the generated code, which is in the body (the actual comment contents are in the comment map). I was just curious if any reviews knows off the top of their heads whether or not this placeholder may affect equality.

@gatorsmile
Copy link
Member

Then, we should add a test case to ensure it will not be broken.

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86882 has finished for PR 20419 at commit f07dc2d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rednaxelafx
Copy link
Contributor

I did the experiment that I asked about at #20419 (comment) , and verified that under the current implementation, this PR will not affect the codegen cache hit behavior.

But the way it's working is a bit brittle. The body field of the CodeAndComment returned from WholeStageCodegenExec.doCodeGen() looks like the following, when spark.sql.codegen.comments=false && spark.sql.codegen.useIdInClassName=false:

/*wholestagecodegen_c*/
final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {

Notice the /*wholestagecodegen_c*/ placeholder comment. It's the artifact produced by the ctx.registerComment() call, where wholestagecodegen_ is a prefix automatically derived from the calling class, and c is from a ctx.freshName("c"), subject to an integer suffix when names collide.

The way it's used right now, because from WholeStageCodegenExec we're only making very few calls to ctx.registerComment() and all of them happen in a deterministic order, that's how this identifier embedded in the placeholder comment would be the same no matter what codegen stage it is.

But let's assume if someone doesn't know about this, and then accidentally added code in WholeStageCodegenExec that conditionally calls ctx.registerComment(), then this placeholder comment is NOT guaranteed to be the same, and that will affect the codegen cache hit behavior.

I'd say functional wise this PR is good to go; maybe we want to add a comment somewhere in WholeStageCodegenExec to note the subtlety, or maybe think about an alternative sometime in the future.

@rednaxelafx
Copy link
Contributor

I gave it a bit more thought. Here's an alternative proposal: instead of using a "force comment" mechanism in the current form (which still gets a ctx.freshName("c") that the caller has no control over), why don't we provide an alternative API that also forces comment, but allows the caller to specify a (within the CodegenContext) unique ID for the placeholder comment?

In this case, instead of using /*wholestagecodegen_c*/ as the placeholder, which can be brittle if someone adds new calls to ctx.registerComment() in WholeStageCodegenExec, we can call ctx.registerComment(comment, placeholderId = "wsc_codegenStageId"), which we can be sure that it'll end up creating a placeholder comment of /*wsc_codegenStageId*/, stable across all whole-stage generated main classes, that would never affect the codegen cache behavior.

@kiszk WDYT?

@kiszk
Copy link
Member Author

kiszk commented Feb 1, 2018

@rednaxelafx I understand your concern when ctx.registerComment() is conditionally called.

If we call ctx.registerComment() with the specific identified multiple times, how do we handle? (e.g. Expression.genCode() may be called multiple times from different contexts.

I am sorry that I will slowly response next few days due to a short vacation.

@rednaxelafx
Copy link
Contributor

rednaxelafx commented Feb 1, 2018

@kiszk For this specific kind of usage, I don't think using a hardcoded stable ID will be a problem.
The comment we're talking about is the kind the can only appear once in a single whole-stage codegen unit -- there's only one GeneratedIterator class with the processNext() method as the main entry point. Providing such a mechanism allows codegen developers to mark important information without risking to affect codegen cache behavior.

For safety, we can add a runtime check. Let's assume this new method is called ctx.registerCommentWithId(), then inside this method we can implement something very similar to ctx.freshName(), but instead of producing a new name with a potential integer suffix appended, we can either log a warning message or throw an exception. That way the developer would have an easy way to catch this.

I am sorry that I will slowly response next few days due to a short vacation.

Have fun on your vacation!

@SparkQA
Copy link

SparkQA commented Feb 5, 2018

Test build #87076 has finished for PR 20419 at commit cb7a16b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Feb 6, 2018

Retest this please

@SparkQA
Copy link

SparkQA commented Feb 6, 2018

Test build #87084 has finished for PR 20419 at commit cb7a16b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Feb 7, 2018

ping @rednaxelafx

1 similar comment
@kiszk
Copy link
Member Author

kiszk commented Feb 13, 2018

ping @rednaxelafx

Copy link
Contributor

@rednaxelafx rednaxelafx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor nit. Thanks @kiszk !

Sorry for the delay of reviewing this PR. I got caught up in some other projects that fully consumed my bandwidth.

/**
* Register a comment and return the corresponding place holder
*
* @param placeholderId a string for a place holder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we rephrase this ScalaDoc a bit, maybe like:

/**
 * ...
 * @param placeholderId an optionally specified identifier for the comment's placeholder. The caller should make sure this identifier is unique within the compilation unit. If this argument is not specified, a fresh identifier will be automatically created and used as the placeholder.
 * ...
 */

if (SparkEnv.get != null && SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) {
val name = freshName("c")
if (force ||
SparkEnv.get != null && SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated, but should this be a SQL conf?

Copy link
Member

@gatorsmile gatorsmile Feb 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the question is whether we can use SQLConf.get in this case. If not, we might need to keep it like this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the answer seems to be No. See this discussion.

@gatorsmile
Copy link
Member

@kiszk This is unable to merge to 2.3. Could you open a new JIRA for this?

@kiszk
Copy link
Member Author

kiszk commented Feb 13, 2018

I see. I will open a new JIRA tomorrow.

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87405 has finished for PR 20419 at commit 0ad1e72.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants