-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values #32783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Used the key/valueLambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes.
|
ok to test |
|
Thank you for making a PR, @eejbyfeldt . |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #139352 has finished for PR 32783 at commit
|
| val keyArray = result.keyArray() | ||
| val valueArray = result.valueArray() | ||
| var i = 0 | ||
| while (i < result.numElements()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend to use while for perf-intensive code. The proposed code does not cause any perf overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am not sure whether it does or not. But I updated the PR to use a while instead, to be sure. Now the style is also more similar to what was there before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest one looks fine. Thanks ;)
|
Nice catch, @eejbyfeldt and thank you for your contribution. |
|
cc: @viirya |
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine if the tests pass.
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #139456 has finished for PR 32783 at commit
|
|
I would think the test failure was caused by the outage of python infrastructure (https://status.python.org/) and not my code. How can I retest this branch? |
|
retest this please |
Yea, the PR looks fine cuz the GA tests passed. |
|
Kubernetes integration test starting |
|
Test build #139513 has finished for PR 32783 at commit
|
|
Kubernetes integration test status success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StructConverter converts from Catalyst struct to a Row. Will it be a behavior change? Although it is Catalyst expression.
From my understanding this is exactly the bug being fixed. So there will be behavior change. But the behavior is changed such that the interpreted path and the code gen path has the same behavior. The thinking being that the old behavior was undesired and incorrect. My understanding of the failure the test cases (if added without the patch): Is that the value of the The reason I believe my change is the correct is that using the Please correct me if I am wrong about anything or misunderstood the question as I am new to the code base. |
|
I think the new behavior looks more correct. Just wondering how the change affects developers or users. Usually for user-facing change, we can document in migration guide. This is per Catalyst expression change, not sure if migration guide is suitable too. cc @cloud-fan |
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I believe this a bug fix instead of a breaking change.
|
Thanks @eejbyfeldt for your first contribution! Added you into contributor list on JIRA and assigned this ticket to you. Welcome to Spark community. Thanks @maropu @cloud-fan for review. Merging to master, 3.1, 3.0. |
…or Map with case classes as keys or values ### What changes were proposed in this pull request? Use the key/value LambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes. ### Why are the changes needed? Before these changes the added test cases would fail with the following: ``` [info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds) [info] Encoded/Decoded data does not match input data [info] [info] in: Map(1 -> IntAndString(1,a)) [info] out: Map(1 -> [1,a]) [info] types: scala.collection.immutable.Map$Map1 [info] [info] Encoded Data: [org.apache.spark.sql.catalyst.expressions.UnsafeMapData5ecf5d9e] [info] Schema: value#823 [info] root [info] -- value: map (nullable = true) [info] |-- key: integer [info] |-- value: struct (valueContainsNull = true) [info] | |-- i: integer (nullable = false) [info] | |-- s: string (nullable = true) [info] [info] [info] fromRow Expressions: [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, map<int,struct<i:int,s:string>>, true], interface scala.collection.immutable.Map [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : :- null [info] : +- newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s.toString [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] +- input[0, map<int,struct<i:int,s:string>>, true] (ExpressionEncoderSuite.scala:627) ``` So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the bug. ### How was this patch tested? Existing and new unit tests in the ExpressionEncoderSuite Closes #32783 from eejbyfeldt/fix-interpreted-path-for-map-with-case-classes. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit e2e3fe7) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
…or Map with case classes as keys or values ### What changes were proposed in this pull request? Use the key/value LambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes. ### Why are the changes needed? Before these changes the added test cases would fail with the following: ``` [info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds) [info] Encoded/Decoded data does not match input data [info] [info] in: Map(1 -> IntAndString(1,a)) [info] out: Map(1 -> [1,a]) [info] types: scala.collection.immutable.Map$Map1 [info] [info] Encoded Data: [org.apache.spark.sql.catalyst.expressions.UnsafeMapData5ecf5d9e] [info] Schema: value#823 [info] root [info] -- value: map (nullable = true) [info] |-- key: integer [info] |-- value: struct (valueContainsNull = true) [info] | |-- i: integer (nullable = false) [info] | |-- s: string (nullable = true) [info] [info] [info] fromRow Expressions: [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, map<int,struct<i:int,s:string>>, true], interface scala.collection.immutable.Map [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : :- null [info] : +- newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s.toString [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] +- input[0, map<int,struct<i:int,s:string>>, true] (ExpressionEncoderSuite.scala:627) ``` So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the bug. ### How was this patch tested? Existing and new unit tests in the ExpressionEncoderSuite Closes #32783 from eejbyfeldt/fix-interpreted-path-for-map-with-case-classes. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit e2e3fe7) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
|
Thank you, all~ |
…or Map with case classes as keys or values ### What changes were proposed in this pull request? Use the key/value LambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes. ### Why are the changes needed? Before these changes the added test cases would fail with the following: ``` [info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds) [info] Encoded/Decoded data does not match input data [info] [info] in: Map(1 -> IntAndString(1,a)) [info] out: Map(1 -> [1,a]) [info] types: scala.collection.immutable.Map$Map1 [info] [info] Encoded Data: [org.apache.spark.sql.catalyst.expressions.UnsafeMapData5ecf5d9e] [info] Schema: value#823 [info] root [info] -- value: map (nullable = true) [info] |-- key: integer [info] |-- value: struct (valueContainsNull = true) [info] | |-- i: integer (nullable = false) [info] | |-- s: string (nullable = true) [info] [info] [info] fromRow Expressions: [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, map<int,struct<i:int,s:string>>, true], interface scala.collection.immutable.Map [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : :- null [info] : +- newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s.toString [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] +- input[0, map<int,struct<i:int,s:string>>, true] (ExpressionEncoderSuite.scala:627) ``` So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the bug. ### How was this patch tested? Existing and new unit tests in the ExpressionEncoderSuite Closes apache#32783 from eejbyfeldt/fix-interpreted-path-for-map-with-case-classes. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit e2e3fe7) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
…or Map with case classes as keys or values ### What changes were proposed in this pull request? Use the key/value LambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes. ### Why are the changes needed? Before these changes the added test cases would fail with the following: ``` [info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) *** FAILED *** (64 milliseconds) [info] Encoded/Decoded data does not match input data [info] [info] in: Map(1 -> IntAndString(1,a)) [info] out: Map(1 -> [1,a]) [info] types: scala.collection.immutable.Map$Map1 [info] [info] Encoded Data: [org.apache.spark.sql.catalyst.expressions.UnsafeMapData5ecf5d9e] [info] Schema: value#823 [info] root [info] -- value: map (nullable = true) [info] |-- key: integer [info] |-- value: struct (valueContainsNull = true) [info] | |-- i: integer (nullable = false) [info] | |-- s: string (nullable = true) [info] [info] [info] fromRow Expressions: [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, map<int,struct<i:int,s:string>>, true], interface scala.collection.immutable.Map [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : :- null [info] : +- newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s.toString [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] +- input[0, map<int,struct<i:int,s:string>>, true] (ExpressionEncoderSuite.scala:627) ``` So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the bug. ### How was this patch tested? Existing and new unit tests in the ExpressionEncoderSuite Closes apache#32783 from eejbyfeldt/fix-interpreted-path-for-map-with-case-classes. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit e2e3fe7) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
What changes were proposed in this pull request?
Use the key/value LambdaFunction to convert the elements instead of
using CatalystTypeConverters.createToScalaConverter. This is how it is
done in MapObjects and that correctly handles Arrays with case classes.
Why are the changes needed?
Before these changes the added test cases would fail with the following:
So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation.
Does this PR introduce any user-facing change?
Yes, it fixes the bug.
How was this patch tested?
Existing and new unit tests in the ExpressionEncoderSuite