-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,6 +31,7 @@ import org.apache.spark.sql.catalyst.expressions.aggregate._ | |
| import org.apache.spark.sql.catalyst.plans.logical._ | ||
| import org.apache.spark.sql.catalyst.util.toPrettySQL | ||
| import org.apache.spark.sql.execution.aggregate.TypedAggregateExpression | ||
| import org.apache.spark.sql.functions.{lit, struct} | ||
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.types.{NumericType, StructType} | ||
|
|
||
|
|
@@ -72,7 +73,8 @@ class RelationalGroupedDataset protected[sql]( | |
| case RelationalGroupedDataset.PivotType(pivotCol, values) => | ||
| val aliasedGrps = groupingExprs.map(alias) | ||
| Dataset.ofRows( | ||
| df.sparkSession, Pivot(Some(aliasedGrps), pivotCol, values, aggExprs, df.logicalPlan)) | ||
| df.sparkSession, | ||
| Pivot(Some(aliasedGrps), pivotCol, values.map(_.expr), aggExprs, df.logicalPlan)) | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -335,7 +337,7 @@ class RelationalGroupedDataset protected[sql]( | |
| * @since 1.6.0 | ||
| */ | ||
| def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = { | ||
| pivot(Column(pivotColumn), values) | ||
| pivot(Column(pivotColumn), values.map(lit)) | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -359,7 +361,7 @@ class RelationalGroupedDataset protected[sql]( | |
| * @since 1.6.0 | ||
| */ | ||
| def pivot(pivotColumn: String, values: java.util.List[Any]): RelationalGroupedDataset = { | ||
| pivot(Column(pivotColumn), values) | ||
| pivot(Column(pivotColumn), values.asScala.map(lit)) | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -371,6 +373,12 @@ class RelationalGroupedDataset protected[sql]( | |
| * df.groupBy($"year").pivot($"course").sum($"earnings"); | ||
| * }}} | ||
| * | ||
| * For pivoting by multiple columns, use the `struct` function to combine the columns: | ||
| * | ||
| * {{{ | ||
| * df.groupBy($"year").pivot(struct($"course", $"training")).agg(sum($"earnings")) | ||
| * }}} | ||
| * | ||
| * @param pivotColumn he column to pivot. | ||
| * @since 2.4.0 | ||
| */ | ||
|
|
@@ -384,6 +392,10 @@ class RelationalGroupedDataset protected[sql]( | |
| .sort(pivotColumn) // ensure that the output columns are in a consistent logical order | ||
| .collect() | ||
| .map(_.get(0)) | ||
| .collect { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use "map"? |
||
| case row: GenericRow => struct(row.values.map(lit): _*) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suspect this will not work for nested struct types, or say, multiple pivot columns with nested type. Could you please add a test like: And can we also check if it works for other complex nested types, like Array(Struct(...))? |
||
| case value => lit(value) | ||
| } | ||
| .toSeq | ||
|
|
||
| if (values.length > maxValues) { | ||
|
|
@@ -403,20 +415,29 @@ class RelationalGroupedDataset protected[sql]( | |
| * | ||
| * {{{ | ||
| * // Compute the sum of earnings for each year by course with each course as a separate column | ||
| * df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings") | ||
| * df.groupBy($"year").pivot($"course", Seq(lit("dotNET"), lit("Java"))).sum($"earnings") | ||
| * }}} | ||
| * | ||
| * For pivoting by multiple columns, use the `struct` function to combine the columns and values: | ||
| * | ||
| * {{{ | ||
| * df | ||
| * .groupBy($"year") | ||
| * .pivot(struct($"course", $"training"), Seq(struct(lit("java"), lit("Experts")))) | ||
| * .agg(sum($"earnings")) | ||
| * }}} | ||
| * | ||
| * @param pivotColumn the column to pivot. | ||
| * @param values List of values that will be translated to columns in the output DataFrame. | ||
| * @since 2.4.0 | ||
| */ | ||
| def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = { | ||
| def pivot(pivotColumn: Column, values: Seq[Column]): RelationalGroupedDataset = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @HyukjinKwon I think this change is better than what #21699 did.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, wouldn't we better allow this BTW, we should document this in the param and describe the difference clearly in the documentation. Otherwise, seems the current API change makes the usage potentially quite confusing to me.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am okay if that's what you all guys think. It should really be clearly documented then now if we go ahead with the current way.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @HyukjinKwon
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, I didn't mean to add another signature. My only worry is that I was thinking we should allow both cases for both APIs. Otherwise, it can be confusing, isn't it? These differences should really be clarified.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I mean: After:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think @MaxGekk's intention was to keep the old signature as it is but somehow used "lit" which takes
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think #22030 (comment) makes perfect sense. We really don't need to make it complicated.
My question was that it's from your speculation or actual feedback from users since the original interface has existed for few years and I haven't seen some complaints about this so far as far as I can tell. It's okay if we clearly document this with some examples. It wouldn't necessarily make some differences between same overloaded APIs.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is an actual feedback from our users who want to do pivoting by multiple columns. They have to use an external systems (even Microsoft Excel does it better) for pivoting by many columns for now because Spark doesn't allow this. You cannot express for example this on the latest release: via
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the point is if users really get confused or not by |
||
| groupType match { | ||
| case RelationalGroupedDataset.GroupByType => | ||
| new RelationalGroupedDataset( | ||
| df, | ||
| groupingExprs, | ||
| RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply))) | ||
| RelationalGroupedDataset.PivotType(pivotColumn.expr, values)) | ||
| case _: RelationalGroupedDataset.PivotType => | ||
| throw new UnsupportedOperationException("repeated pivots are not supported") | ||
| case _ => | ||
|
|
@@ -433,7 +454,7 @@ class RelationalGroupedDataset protected[sql]( | |
| * @param values List of values that will be translated to columns in the output DataFrame. | ||
| * @since 2.4.0 | ||
| */ | ||
| def pivot(pivotColumn: Column, values: java.util.List[Any]): RelationalGroupedDataset = { | ||
| def pivot(pivotColumn: Column, values: java.util.List[Column]): RelationalGroupedDataset = { | ||
| pivot(pivotColumn, values.asScala) | ||
| } | ||
|
|
||
|
|
@@ -561,5 +582,5 @@ private[sql] object RelationalGroupedDataset { | |
| /** | ||
| * To indicate it's the PIVOT | ||
| */ | ||
| private[sql] case class PivotType(pivotCol: Expression, values: Seq[Literal]) extends GroupType | ||
| private[sql] case class PivotType(pivotCol: Expression, values: Seq[Column]) extends GroupType | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -306,6 +306,22 @@ public void pivot() { | |
| Assert.assertEquals(30000.0, actual.get(1).getDouble(2), 0.01); | ||
| } | ||
|
|
||
| @Test | ||
| public void pivotColumnValues() { | ||
| Dataset<Row> df = spark.table("courseSales"); | ||
| List<Row> actual = df.groupBy("year") | ||
| .pivot(col("course"), Arrays.asList(lit("dotNET"), lit("Java"))) | ||
| .agg(sum("earnings")).orderBy("year").collectAsList(); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: 2 space indentation |
||
|
|
||
| Assert.assertEquals(2012, actual.get(0).getInt(0)); | ||
| Assert.assertEquals(15000.0, actual.get(0).getDouble(1), 0.01); | ||
| Assert.assertEquals(20000.0, actual.get(0).getDouble(2), 0.01); | ||
|
|
||
| Assert.assertEquals(2013, actual.get(1).getInt(0)); | ||
| Assert.assertEquals(48000.0, actual.get(1).getDouble(1), 0.01); | ||
| Assert.assertEquals(30000.0, actual.get(1).getDouble(2), 0.01); | ||
| } | ||
|
|
||
| private String getResource(String resource) { | ||
| try { | ||
| // The following "getResource" has different behaviors in SBT and Maven. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to allow
pivot(String, Seq[Any])also takeColumn. Did I misread the codes?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you did. This "old" interface only takes in a single named column (say, "a", but not "a+1") by its name, but we turn it into a
Columnjust to reuse the same implementation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we did allow only liternals but not generic columns before, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with Seq[Any] we only allow literal values, not
Columns.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using "lit" here is causing the confusion then (perhaps @MaxGekk was not aware of that?). We should keep the current behavior of this signature as it is. Using
Column(Literal.create(value))would do.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? I think we should allow it and then make this #22030 (comment) assumption stay true.