-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22826][SQL] findWiderTypeForTwo Fails over StructField of Array #20010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1ac4b0b to
fb48148
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this if t1.sameType(t2) needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make sure the later get won't be applied to None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If t1 and t2 are the sameType, why do we need recursively findTightestCommonType? Can we just make val dataType = pointType1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the reason is for nested structures, for instance an array of array or array of map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then @bdrillard can we add also some test cases for nested structures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use this example, if we call findTightestCommonType on 2 array of array of Integer types, then this dataType = findTightest(findTightest(IntegralType, IntegralType)) = IntegralType, so the final answer is Some(ArrayType(IntegralType))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 Thanks, I got it. But the t1.sameType(t2) judgement has already made sure that the 2 types are the same recursively. So it's not necessary to findTightestCommonType recursively again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gczsjdy I think it is needed, because IIUC in sameType the nullability is not checked to be the same. So it might vary and calling findTightestCommonType does the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gczsjdy sameType will ignore the nullability of the two types, only checking if the DataType is the same. We would expect then that the first case of findTightestCommonType would pass the types on through, but during its equality check, it also checks the type nullability, which for ArrayType and MapType, may differ between t1 and t2:
case (t1, t2) if t1 == t1 => Some(t1)
That means we can establish that two StructFields of ArrayType/MapType are the same DataType, but if they have different nullability, then the above case match won't match them, nor will any other case in the match set. So in order to find the tightest common type where nullabilities of the point-types may differ, we'll need to recurse. See a case like:
widenTest(
StructType(StructField("a", ArrayType(StringType, containsNull=true)) :: Nil),
StructType(StructField("a", ArrayType(StringType, containsNull=false)) :: Nil),
Some(StructType(StructField("a", ArrayType(StringType, containsNull=true)) :: Nil)))
At the moment, since we have no case to match ArrayType/MapType where the nullabilities may differ, this case would fail. I can add this test explicitly to the suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your illustration.
c5bb197 to
98709fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes sense, but I'd love that in your implementation in findTightestCommonType, you would replicate this logic, ie. removing the sameType guard and using findWiderTypeForTwo, in order to allow casting an array of int to an array of long. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think that's possible. In order to handle instances with and without string promotion, I think it may be necessary to add a boolean parameter, and then to handle the instances where the pointType/keyType and valueType may result in None, see https://github.com/apache/spark/pull/20010/files#diff-383a8cdd0a9c58cae68e0a79295520a3R105
To support the minor change in function signature for findTightestCommonType, I have to do some refactoring in the tests. Let me know if you think there's a cleaner implementation, but this seems to help localize like concerns into findTightestCommonType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense, but I doubt the solution we call findWiderType.. in findTightest... This seems not reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion: we define a new function, like findWiderTypeForArray. This new function can provide 'findWider' functionality compared to the 'findTightestArray' part(which is basically the first commit in your PR). We won't break the original findTightest semantic in this way and the code is clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your idea @gczsjdy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gczsjdy I've taken a shot at implementing your suggestion with findWiderTypeForTwoComplex, which takes as an argument a widerTypeFunc, describing which widening behavior to apply to point types (should they permit promotion to string or not).
Because ArrayType instances that would require widening the type could be nested in StructType and MapType, I think it's necessary to have more case matching than would be in findWiderTypeForArray, hence findWiderTypeForTwoComplex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your implementation, except for the concerns I raised in another thread. Have you tested locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this PR should be ready for a Jenkins build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same sanity check here, as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a test for nested structures where an explicit match case against ArrayType/MapType is necessary due to the difference in nullability between the two structures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, may you please add also two tests having Map and Array as outer types?
5a304e6 to
da36de8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we break the findTightest semantic by withStringPromotion judgement and calling findWiderTypeForTwo, these are what we should add in Case 2 type widening -> findWiderTypeForTwo. I suggest we put these logic in another functionfindWiderTypeForDecimal. Will mention this in the other thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion: we define a new function, like findWiderTypeForArray. This new function can provide 'findWider' functionality compared to the 'findTightestArray' part(which is basically the first commit in your PR). We won't break the original findTightest semantic in this way and the code is clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this cause infinite loop? Calling findWiderTypeForTwo in findWiderTypeForTwo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not. findWiderTypeForTwoComplex will only be called as we operate over "complex" types (i.e arrays, maps, structs), and will only recurse (calling findWiderTypeForTwo) over the child point-types of a complex type, so we ensure the recursive computation gets narrower as it recurses, until eventually terminating at the leaf level of the schema.
… types in a structfield [SPARK-22826] adding nested struct tests [SPARK-22826] removing extraneous test case
[SPARK-22826] adding test for integral type promotion [SPARK-22826] supporting array and map string promotion
This reverts commit da36de8b4035e143efa388b277596dceb185130a.
538194b to
527099d
Compare
| case _ => None | ||
| }) | ||
| .orElse(findWiderTypeForTwoComplex(t1, t2, findWiderTypeForTwo)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, see dfb4a3b47d88853be348aafd9802f799639693f6
| widenTest(ArrayType(IntegerType), StructType(Seq()), None) | ||
|
|
||
| widenTest( | ||
| ArrayType(StringType, containsNull=true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a whitespace before and after = and the same in the following lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, see dfb4a3b47d88853be348aafd9802f799639693f6
| widenTest( | ||
| MapType(StringType, StringType, valueContainsNull=true), | ||
| MapType(StringType, StringType, valueContainsNull=false), | ||
| Some(MapType(StringType, StringType, valueContainsNull=true))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a test also for struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are tests for nested complex types, https://github.com/apache/spark/pull/20010/files#diff-01ecdd038c5c2f53f38118912210fef8R405
| widenTestWithStringPromotion( | ||
| StructType(StructField("a", ArrayType(LongType)) :: Nil), | ||
| StructType(StructField("a", ArrayType(StringType)) :: Nil), | ||
| Some(StructType(StructField("a", ArrayType(StringType)) :: Nil))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add an analogous test also for map and array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've grouped string promotion tests for arrays and maps, with tests where those types are nested in structs or are fields in structs, https://github.com/apache/spark/pull/20010/files#diff-01ecdd038c5c2f53f38118912210fef8R504
| widenTestWithoutStringPromotion( | ||
| StructType(StructField("a", ArrayType(LongType)) :: Nil), | ||
| StructType(StructField("a", ArrayType(StringType)) :: Nil), | ||
| None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've grouped non-string promotion tests, and added tests for arrays, maps, and then structures nested in those types or having fields of those types, https://github.com/apache/spark/pull/20010/files#diff-01ecdd038c5c2f53f38118912210fef8R563
| StructType(StructField("a", ArrayType(LongType)) :: Nil), | ||
| StructType(StructField("a", ArrayType(StringType)) :: Nil), | ||
| Some(StructType(StructField("a", ArrayType(StringType)) :: Nil))) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this empty line is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, see dfb4a3b47d88853be348aafd9802f799639693f6
|
@bdrillard yes, after addressing my last comments I think we can build this. My only concern is that we are not covering the case in which there are nested complex structures in the wider case. But maybe we can address this later. I don't have permissions to trigger a build. You might check who are the last committers working on this and maybe ask them. |
b7de6dc to
8ce3b39
Compare
|
@mgaido91 Agreed with that concern. I think the last round of tests I've just added covers the permutation of cases well, where we have arrays and maps of structs, and structs of arrays and maps. cc: @viirya, @gatorsmile, Could perhaps one of you trigger a Jenkins build against this PR? |
|
@HyukjinKwon Can you help trigger Jenkins test for this? Thanks. |
|
ok to test |
|
Test build #85447 has finished for PR 20010 at commit
|
|
Thanks @HyukjinKwon |
|
Seems not a regular error? |
8ce3b39 to
86e1929
Compare
|
Test build #85497 has finished for PR 20010 at commit
|
|
@gczsjdy @bdrillard the test errors are valid. In some cases an exception is expected to be thrown, but it isn't, due to the fix. So they should be updated accordingly. |
|
Test build #85514 has finished for PR 20010 at commit
|
|
Test build #85520 has finished for PR 20010 at commit
|
|
Test build #85515 has finished for PR 20010 at commit
|
|
Test build #85521 has finished for PR 20010 at commit
|
|
Test build #85526 has finished for PR 20010 at commit
|
| } | ||
|
|
||
| case (StructType(fields1), StructType(fields2)) => | ||
| val fieldTypes = fields1.zip(fields2).map { case (f1, f2) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this requires that fields with the same name must also be in the same position. Is this assumption correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 That's seems to be the assumption already made in findTightestCommonType. The sameType function on DataType also requires structfields are ordered the same, else it returns false.
The difference here is that we don't require the structfields strictly have the same type, so we can support widening to LongType, StringType, etc. But we do require the fields 1. have the same order, and 2. have the same name (either with strict case, or ignoring case).
|
Will review it today. |
| case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => | ||
| Some(TimestampType) | ||
|
|
||
| case (t1 @ ArrayType(pointType1, nullable1), t2 @ ArrayType(pointType2, nullable2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pointType looks like a weird name, how about elementType or et?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, see 09e49fb
| ArrayType(StringType, containsNull = true), | ||
| ArrayType(ArrayType(IntegerType), containsNull = true), | ||
| ArrayType(ArrayType(LongType), containsNull = false), | ||
| Some(ArrayType(ArrayType(LongType), containsNull = true))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really wanna do this? Is there any other database have this behavior?
I think for complex type, we should ignore the nullable difference, but I'm not sure if we should do type coercion inside complex type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91, thoughts on this?
It's definitely possible for us to revert back to behavior where we don't do IntegerType-to-LongType, xType-to-StringType, etc. promotion inside complex types, which was how a previous form of this PR handled it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more coherent with the behavior in other parts and this is the behavior I would expect. But I think that we should follow @gatorsmile's suggestion and check Hive's behavior first.
|
Test build #85682 has finished for PR 20010 at commit
|
|
Overall, it is reasonable. What is the current behavior in Hive? |
|
@bdrillard I think that we need to know Hive's behavior as per #20010 (comment) to state which is the preferred/right behavior here. |
|
ok to test |
|
Test build #91613 has finished for PR 20010 at commit
|
|
I think this is already fixed. |
|
Can one of the admins verify this patch? |
What changes were proposed in this pull request?
SPARK-22826
Attempting to find the tightest common type over a struct holding fields of
ArrayTypeorMapTypewill fail if the types of those struct fields are not exactly equal. See the JIRA for an example.How was this patch tested?
This patch adds a test case for
ArrayTypeandMapTypefor finding tightest common types.