Skip to content

Conversation

@ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Jan 5, 2021

What changes were proposed in this pull request?

This PR aims to strip auto-generated cast. The main logic is:

  1. Add tag if Cast is specified by user.
  2. Wrap PrettyAttribute in usePrettyExpression.

Why are the changes needed?

Make sql consistent with dsl. Here is an inconsistent example before this PR:

-- output field name: FLOOR(1)
spark.emptyDataFrame.select(floor(lit(1)))

-- output field name: FLOOR(CAST(1 AS DOUBLE))
spark.sql("select floor(1)")

Note that, we don't remove the Cast so the auto-generated Cast can still work. The only changed place is usePrettyExpression, we use PrettyAttribute replace Cast to give a better sql string.

Does this PR introduce any user-facing change?

Yes, the default field name may change.

How was this patch tested?

Add test and pass exists test.

@ulysses-you
Copy link
Contributor Author

Seems the change is quite big but mostly are golden files cc @maropu @cloud-fan

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38256/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38256/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133667 has finished for PR 31034 at commit 7b00641.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jan 5, 2021

(I'm neutral on this though) why is the name consistency important? Is there the concrete scenario where the name inconsistency can cause any issue?

@github-actions github-actions bot added the SQL label Jan 5, 2021
@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38286/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38286/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Test build #133698 has finished for PR 31034 at commit a160d62.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// back to SQL query string.
case _: ArrayType | _: MapType | _: StructType => child.sql
case _ => s"CAST(${child.sql} AS ${dataType.sql})"
override def sql: String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change the sql method.

ResolveAlias calls toPrettySQL to generate the alias name, and we should strip auto-generated alias in toPrettySQL.

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38386/

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38386/

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Test build #133796 has finished for PR 31034 at commit 3823289.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38516/

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38516/

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Test build #133924 has finished for PR 31034 at commit ab827fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Test build #133929 has finished for PR 31034 at commit e5294fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I like this idea to exclude auto-added cast in the auto-generated alias. This should be a safe change as the auto-generated alias is mostly for display purposes. Let's get more feedback though. cc @maropu @viirya @dongjoon-hyun @HyukjinKwon

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 13, 2021

I'm okay. I wonder if it's better and possible to have that cast when we explain alone, and use the stripped output for column names but I don't feel strongly on this., oh, I took a look again. That's already what this PR does.

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38575/

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38575/

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Test build #133987 has finished for PR 31034 at commit 3c23778.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ulysses-you
Copy link
Contributor Author

How about add a config like extended or simple to decide whether show the Cast or not ? That sounds more like a debug config which maybe can help user. But after another thought, it seems a little redundent since we have already given the analyzed output data type in extended explain.

case r: RuntimeReplaceable =>
PrettyAttribute(r.mkString(r.exprsReplaced.map(toPrettySQL)), r.dataType)
case c: CastBase if !c.getTagValue(Cast.USER_SPECIFIED_CAST).getOrElse(false) =>
PrettyAttribute(usePrettyExpression(c.child).sql, c.dataType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the purpose is to strip Cast only from toPrettySQL? Only for changing column name but the actual Cast expression still works?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate it clearly in the PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the only place which we changed. Update the description.

@viirya
Copy link
Member

viirya commented Jan 13, 2021

I like this idea to exclude auto-added cast in the auto-generated alias. This should be a safe change as the auto-generated alias is mostly for display purposes. Let's get more feedback though. cc @maropu @viirya @dongjoon-hyun @HyukjinKwon

If the change is restricted to only auto-generated alias, it sounds okay. But this still sounds like a breaking change as the column name is changed? Do we need to update migration guide for this?

@github-actions github-actions bot added the DOCS label Jan 13, 2021
@cloud-fan
Copy link
Contributor

Do we need to update migration guide for this?

yea better to add. But I do wonder if people will do

df.select("`CAST(a AS INT)`")

to select that column...

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38596/

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38596/

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Test build #134005 has finished for PR 31034 at commit 3217d5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Test build #134008 has finished for PR 31034 at commit 2d13353.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 92e5cfd Jan 14, 2021
@HyukjinKwon
Copy link
Member

i took a close look too. LGTM.

@ulysses-you
Copy link
Contributor Author

thanks all !

@ulysses-you ulysses-you deleted the SPARK-33989 branch January 15, 2021 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants