[SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values #32783

eejbyfeldt · 2021-06-04T14:37:11Z

What changes were proposed in this pull request?

Use the key/value LambdaFunction to convert the elements instead of
using CatalystTypeConverters.createToScalaConverter. This is how it is
done in MapObjects and that correctly handles Arrays with case classes.

Why are the changes needed?

Before these changes the added test cases would fail with the following:

[info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds)
[info]   Encoded/Decoded data does not match input data
[info]   
[info]   in:  Map(1 -> IntAndString(1,a))
[info]   out: Map(1 -> [1,a])
[info]   types: scala.collection.immutable.Map$Map1 [info]   
[info]   Encoded Data: [org.apache.spark.sql.catalyst.expressions.UnsafeMapData@5ecf5d9e]
[info]   Schema: value#823
[info]   root
[info]   -- value: map (nullable = true)
[info]       |-- key: integer
[info]       |-- value: struct (valueContainsNull = true)
[info]       |    |-- i: integer (nullable = false)
[info]       |    |-- s: string (nullable = true)
[info]   
[info]   
[info]   fromRow Expressions:
[info]   catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, map<int,struct<i:int,s:string>>, true], interface scala.collection.immutable.Map
[info]   :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178)
[info]   :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178)
[info]   :- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
[info]   :- if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString)
[info]   :  :- isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))
[info]   :  :  +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
[info]   :  :- null
[info]   :  +- newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString)
[info]   :     :- assertnotnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i)
[info]   :     :  +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i
[info]   :     :     +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
[info]   :     +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s.toString
[info]   :        +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s
[info]   :           +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)
[info]   +- input[0, map<int,struct<i:int,s:string>>, true] (ExpressionEncoderSuite.scala:627)

So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation.

Does this PR introduce any user-facing change?

Yes, it fixes the bug.

How was this patch tested?

Existing and new unit tests in the ExpressionEncoderSuite

Used the key/valueLambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes.

dongjoon-hyun · 2021-06-04T21:08:00Z

ok to test

dongjoon-hyun · 2021-06-04T21:08:48Z

Thank you for making a PR, @eejbyfeldt .

SparkQA · 2021-06-04T21:54:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43874/

SparkQA · 2021-06-04T22:29:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43874/

SparkQA · 2021-06-05T01:14:11Z

Test build #139352 has finished for PR 32783 at commit e95d7f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-06-08T04:58:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

-      val keyArray = result.keyArray()
-      val valueArray = result.valueArray()
-      var i = 0
-      while (i < result.numElements()) {


We tend to use while for perf-intensive code. The proposed code does not cause any perf overhead?

I guess I am not sure whether it does or not. But I updated the PR to use a while instead, to be sure. Now the style is also more similar to what was there before.

The latest one looks fine. Thanks ;)

maropu · 2021-06-08T05:00:04Z

Nice catch, @eejbyfeldt and thank you for your contribution.

maropu · 2021-06-08T05:00:20Z

cc: @viirya

maropu

Looks fine if the tests pass.

SparkQA · 2021-06-08T07:27:29Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43979/

SparkQA · 2021-06-08T10:20:11Z

Test build #139456 has finished for PR 32783 at commit 9cc2484.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

eejbyfeldt · 2021-06-08T13:52:10Z

I would think the test failure was caused by the outage of python infrastructure (https://status.python.org/) and not my code. How can I retest this branch?

maropu · 2021-06-08T23:45:40Z

retest this please

maropu · 2021-06-08T23:46:19Z

I would think the test failure was caused by the outage of python infrastructure (https://status.python.org/) and not my code. How can I retest this branch?

Yea, the PR looks fine cuz the GA tests passed.

SparkQA · 2021-06-09T01:15:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44038/

SparkQA · 2021-06-09T01:24:45Z

Test build #139513 has finished for PR 32783 at commit 9cc2484.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-09T01:51:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44038/

viirya

StructConverter converts from Catalyst struct to a Row. Will it be a behavior change? Although it is Catalyst expression.

eejbyfeldt · 2021-06-10T07:45:47Z

StructConverter converts from Catalyst struct to a Row. Will it be a behavior change? Although it is Catalyst expression.

From my understanding this is exactly the bug being fixed. So there will be behavior change. But the behavior is changed such that the interpreted path and the code gen path has the same behavior. The thinking being that the old behavior was undesired and incorrect.

My understanding of the failure the test cases (if added without the patch):

[info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds)
[info]   Encoded/Decoded data does not match input data
[info]   
[info]   in:  Map(1 -> IntAndString(1,a))
[info]   out: Map(1 -> [1,a])

Is that the value of the Map was converted to an [1,a] which I believe is an InternalRow instead of the expected IntAndString. This sounds like the change you are mentioning?

The reason I believe my change is the correct is that using the key/valueLambdaFunction is since this is how it is done inside MapObjects:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L820-L826

Please correct me if I am wrong about anything or misunderstood the question as I am new to the code base.

viirya · 2021-06-10T08:02:03Z

I think the new behavior looks more correct. Just wondering how the change affects developers or users. Usually for user-facing change, we can document in migration guide. This is per Catalyst expression change, not sure if migration guide is suitable too.

cc @cloud-fan

cloud-fan

LGTM. I believe this a bug fix instead of a breaking change.

viirya · 2021-06-10T16:36:33Z

Thanks @eejbyfeldt for your first contribution! Added you into contributor list on JIRA and assigned this ticket to you. Welcome to Spark community.

Thanks @maropu @cloud-fan for review. Merging to master, 3.1, 3.0.

…or Map with case classes as keys or values ### What changes were proposed in this pull request? Use the key/value LambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes. ### Why are the changes needed? Before these changes the added test cases would fail with the following: ``` [info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds) [info] Encoded/Decoded data does not match input data [info] [info] in: Map(1 -> IntAndString(1,a)) [info] out: Map(1 -> [1,a]) [info] types: scala.collection.immutable.Map$Map1 [info] [info] Encoded Data: [org.apache.spark.sql.catalyst.expressions.UnsafeMapData5ecf5d9e] [info] Schema: value#823 [info] root [info] -- value: map (nullable = true) [info] |-- key: integer [info] |-- value: struct (valueContainsNull = true) [info] | |-- i: integer (nullable = false) [info] | |-- s: string (nullable = true) [info] [info] [info] fromRow Expressions: [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, map<int,struct<i:int,s:string>>, true], interface scala.collection.immutable.Map [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : :- null [info] : +- newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s.toString [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] +- input[0, map<int,struct<i:int,s:string>>, true] (ExpressionEncoderSuite.scala:627) ``` So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the bug. ### How was this patch tested? Existing and new unit tests in the ExpressionEncoderSuite Closes #32783 from eejbyfeldt/fix-interpreted-path-for-map-with-case-classes. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit e2e3fe7) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

maropu · 2021-06-10T23:30:37Z

Thank you, all~

…or Map with case classes as keys or values ### What changes were proposed in this pull request? Use the key/value LambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes. ### Why are the changes needed? Before these changes the added test cases would fail with the following: ``` [info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds) [info] Encoded/Decoded data does not match input data [info] [info] in: Map(1 -> IntAndString(1,a)) [info] out: Map(1 -> [1,a]) [info] types: scala.collection.immutable.Map$Map1 [info] [info] Encoded Data: [org.apache.spark.sql.catalyst.expressions.UnsafeMapData5ecf5d9e] [info] Schema: value#823 [info] root [info] -- value: map (nullable = true) [info] |-- key: integer [info] |-- value: struct (valueContainsNull = true) [info] | |-- i: integer (nullable = false) [info] | |-- s: string (nullable = true) [info] [info] [info] fromRow Expressions: [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, map<int,struct<i:int,s:string>>, true], interface scala.collection.immutable.Map [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : :- null [info] : +- newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s.toString [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] +- input[0, map<int,struct<i:int,s:string>>, true] (ExpressionEncoderSuite.scala:627) ``` So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the bug. ### How was this patch tested? Existing and new unit tests in the ExpressionEncoderSuite Closes apache#32783 from eejbyfeldt/fix-interpreted-path-for-map-with-case-classes. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit e2e3fe7) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

Fix encoder (interpreted path) for Map with case classes

e95d7f4

Used the key/valueLambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes.

github-actions bot added the SQL label Jun 4, 2021

maropu reviewed Jun 8, 2021

View reviewed changes

Use while loop

9cc2484

maropu approved these changes Jun 8, 2021

View reviewed changes

eejbyfeldt mentioned this pull request Jun 8, 2021

[SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) #27153

Closed

viirya reviewed Jun 9, 2021

View reviewed changes

cloud-fan approved these changes Jun 10, 2021

View reviewed changes

viirya approved these changes Jun 10, 2021

View reviewed changes

viirya closed this in e2e3fe7 Jun 10, 2021

[SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values #32783

[SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values #32783

Uh oh!

Conversation

eejbyfeldt commented Jun 4, 2021 • edited by viirya Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Jun 4, 2021

Uh oh!

dongjoon-hyun commented Jun 4, 2021

Uh oh!

SparkQA commented Jun 4, 2021

Uh oh!

SparkQA commented Jun 4, 2021

Uh oh!

SparkQA commented Jun 5, 2021

Uh oh!

maropu Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

eejbyfeldt Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

maropu commented Jun 8, 2021

Uh oh!

maropu commented Jun 8, 2021

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 8, 2021

Uh oh!

SparkQA commented Jun 8, 2021

Uh oh!

eejbyfeldt commented Jun 8, 2021

Uh oh!

maropu commented Jun 8, 2021

Uh oh!

maropu commented Jun 8, 2021

Uh oh!

SparkQA commented Jun 9, 2021

Uh oh!

SparkQA commented Jun 9, 2021

Uh oh!

SparkQA commented Jun 9, 2021

Uh oh!

viirya left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eejbyfeldt commented Jun 10, 2021

Uh oh!

viirya commented Jun 10, 2021

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Jun 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

eejbyfeldt commented Jun 4, 2021 •

edited by viirya

Loading

eejbyfeldt Jun 8, 2021 •

edited

Loading

viirya left a comment •

edited

Loading

viirya commented Jun 10, 2021 •

edited

Loading