Skip to content

Conversation

@bdrillard
Copy link

What changes were proposed in this pull request?

SPARK-22826

Attempting to find the tightest common type over a struct holding fields of ArrayType or MapType will fail if the types of those struct fields are not exactly equal. See the JIRA for an example.

How was this patch tested?

This patch adds a test case for ArrayType and MapType for finding tightest common types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this if t1.sameType(t2) needed?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make sure the later get won't be applied to None

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If t1 and t2 are the sameType, why do we need recursively findTightestCommonType? Can we just make val dataType = pointType1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the reason is for nested structures, for instance an array of array or array of map

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then @bdrillard can we add also some test cases for nested structures?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use this example, if we call findTightestCommonType on 2 array of array of Integer types, then this dataType = findTightest(findTightest(IntegralType, IntegralType)) = IntegralType, so the final answer is Some(ArrayType(IntegralType))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91, @gczsjdy, sure, I'll add a pair of test cases for nested complex structures for ArrayType and MapType

Copy link

@gczsjdy gczsjdy Dec 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 Thanks, I got it. But the t1.sameType(t2) judgement has already made sure that the 2 types are the same recursively. So it's not necessary to findTightestCommonType recursively again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gczsjdy I think it is needed, because IIUC in sameType the nullability is not checked to be the same. So it might vary and calling findTightestCommonType does the trick.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gczsjdy sameType will ignore the nullability of the two types, only checking if the DataType is the same. We would expect then that the first case of findTightestCommonType would pass the types on through, but during its equality check, it also checks the type nullability, which for ArrayType and MapType, may differ between t1 and t2:

case (t1, t2) if t1 == t1 => Some(t1)

That means we can establish that two StructFields of ArrayType/MapType are the same DataType, but if they have different nullability, then the above case match won't match them, nor will any other case in the match set. So in order to find the tightest common type where nullabilities of the point-types may differ, we'll need to recurse. See a case like:

widenTest(
    StructType(StructField("a", ArrayType(StringType, containsNull=true)) :: Nil),
    StructType(StructField("a", ArrayType(StringType, containsNull=false)) :: Nil),
    Some(StructType(StructField("a", ArrayType(StringType, containsNull=true)) :: Nil)))

At the moment, since we have no case to match ArrayType/MapType where the nullabilities may differ, this case would fail. I can add this test explicitly to the suite.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your illustration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we now check for ArrayTypes (including MapTypes) in findTightestCommonType, the match here becomes redundant. @mgaido91, @gczsjdy, does this thinking make sense to you both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense, but I'd love that in your implementation in findTightestCommonType, you would replicate this logic, ie. removing the sameType guard and using findWiderTypeForTwo, in order to allow casting an array of int to an array of long. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think that's possible. In order to handle instances with and without string promotion, I think it may be necessary to add a boolean parameter, and then to handle the instances where the pointType/keyType and valueType may result in None, see https://github.com/apache/spark/pull/20010/files#diff-383a8cdd0a9c58cae68e0a79295520a3R105

To support the minor change in function signature for findTightestCommonType, I have to do some refactoring in the tests. Let me know if you think there's a cleaner implementation, but this seems to help localize like concerns into findTightestCommonType.

Copy link

@gczsjdy gczsjdy Dec 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense, but I doubt the solution we call findWiderType.. in findTightest... This seems not reasonable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion: we define a new function, like findWiderTypeForArray. This new function can provide 'findWider' functionality compared to the 'findTightestArray' part(which is basically the first commit in your PR). We won't break the original findTightest semantic in this way and the code is clean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea @gczsjdy!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gczsjdy I've taken a shot at implementing your suggestion with findWiderTypeForTwoComplex, which takes as an argument a widerTypeFunc, describing which widening behavior to apply to point types (should they permit promotion to string or not).

Because ArrayType instances that would require widening the type could be nested in StructType and MapType, I think it's necessary to have more case matching than would be in findWiderTypeForArray, hence findWiderTypeForTwoComplex.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your implementation, except for the concerns I raised in another thread. Have you tested locally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this PR should be ready for a Jenkins build.

Copy link
Author

@bdrillard bdrillard Dec 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same sanity check here, as above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a test for nested structures where an explicit match case against ArrayType/MapType is necessary due to the difference in nullability between the two structures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, may you please add also two tests having Map and Array as outer types?

@bdrillard bdrillard force-pushed the SPARK-22826 branch 2 times, most recently from 5a304e6 to da36de8 Compare December 19, 2017 23:14
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we break the findTightest semantic by withStringPromotion judgement and calling findWiderTypeForTwo, these are what we should add in Case 2 type widening -> findWiderTypeForTwo. I suggest we put these logic in another functionfindWiderTypeForDecimal. Will mention this in the other thread.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion: we define a new function, like findWiderTypeForArray. This new function can provide 'findWider' functionality compared to the 'findTightestArray' part(which is basically the first commit in your PR). We won't break the original findTightest semantic in this way and the code is clean.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this cause infinite loop? Calling findWiderTypeForTwo in findWiderTypeForTwo?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not. findWiderTypeForTwoComplex will only be called as we operate over "complex" types (i.e arrays, maps, structs), and will only recurse (calling findWiderTypeForTwo) over the child point-types of a complex type, so we ensure the recursive computation gets narrower as it recurses, until eventually terminating at the leaf level of the schema.

ALeksander Eskilson added 4 commits December 26, 2017 10:54
… types in a structfield

[SPARK-22826] adding nested struct tests

[SPARK-22826] removing extraneous test case
[SPARK-22826] adding test for integral type promotion

[SPARK-22826] supporting array and map string promotion
This reverts commit da36de8b4035e143efa388b277596dceb185130a.
@bdrillard
Copy link
Author

@gczsjdy, @mgaido91 If you all are comfortable with it, I think this PR is in a state where we could trigger a build.

case _ => None
})
.orElse(findWiderTypeForTwoComplex(t1, t2, findWiderTypeForTwo))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should be removed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, see dfb4a3b47d88853be348aafd9802f799639693f6

widenTest(ArrayType(IntegerType), StructType(Seq()), None)

widenTest(
ArrayType(StringType, containsNull=true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a whitespace before and after = and the same in the following lines

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, see dfb4a3b47d88853be348aafd9802f799639693f6

widenTest(
MapType(StringType, StringType, valueContainsNull=true),
MapType(StringType, StringType, valueContainsNull=false),
Some(MapType(StringType, StringType, valueContainsNull=true)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test also for struct

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

widenTestWithStringPromotion(
StructType(StructField("a", ArrayType(LongType)) :: Nil),
StructType(StructField("a", ArrayType(StringType)) :: Nil),
Some(StructType(StructField("a", ArrayType(StringType)) :: Nil)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add an analogous test also for map and array

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've grouped string promotion tests for arrays and maps, with tests where those types are nested in structs or are fields in structs, https://github.com/apache/spark/pull/20010/files#diff-01ecdd038c5c2f53f38118912210fef8R504

widenTestWithoutStringPromotion(
StructType(StructField("a", ArrayType(LongType)) :: Nil),
StructType(StructField("a", ArrayType(StringType)) :: Nil),
None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've grouped non-string promotion tests, and added tests for arrays, maps, and then structures nested in those types or having fields of those types, https://github.com/apache/spark/pull/20010/files#diff-01ecdd038c5c2f53f38118912210fef8R563

StructType(StructField("a", ArrayType(LongType)) :: Nil),
StructType(StructField("a", ArrayType(StringType)) :: Nil),
Some(StructType(StructField("a", ArrayType(StringType)) :: Nil)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this empty line is not needed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, see dfb4a3b47d88853be348aafd9802f799639693f6

@mgaido91
Copy link
Contributor

@bdrillard yes, after addressing my last comments I think we can build this. My only concern is that we are not covering the case in which there are nested complex structures in the wider case. But maybe we can address this later.

I don't have permissions to trigger a build. You might check who are the last committers working on this and maybe ask them.
Thanks.

@bdrillard
Copy link
Author

@mgaido91 Agreed with that concern. I think the last round of tests I've just added covers the permutation of cases well, where we have arrays and maps of structs, and structs of arrays and maps.

cc: @viirya, @gatorsmile, Could perhaps one of you trigger a Jenkins build against this PR?

@viirya
Copy link
Member

viirya commented Dec 27, 2017

@HyukjinKwon Can you help trigger Jenkins test for this? Thanks.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Dec 28, 2017

Test build #85447 has finished for PR 20010 at commit 8ce3b39.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Dec 28, 2017

Thanks @HyukjinKwon

@gczsjdy
Copy link

gczsjdy commented Dec 29, 2017

Seems not a regular error?
@bdrillard Maybe you can push a commit and trigger the test again.

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85497 has finished for PR 20010 at commit 86e1929.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

@gczsjdy @bdrillard the test errors are valid. In some cases an exception is expected to be thrown, but it isn't, due to the fix. So they should be updated accordingly.

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85514 has finished for PR 20010 at commit 583f674.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85520 has finished for PR 20010 at commit e034bbc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85515 has finished for PR 20010 at commit 23c83ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85521 has finished for PR 20010 at commit 7d146e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85526 has finished for PR 20010 at commit d42dfa5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

case (StructType(fields1), StructType(fields2)) =>
val fieldTypes = fields1.zip(fields2).map { case (f1, f2) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this requires that fields with the same name must also be in the same position. Is this assumption correct?

Copy link
Author

@bdrillard bdrillard Jan 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 That's seems to be the assumption already made in findTightestCommonType. The sameType function on DataType also requires structfields are ordered the same, else it returns false.

The difference here is that we don't require the structfields strictly have the same type, so we can support widening to LongType, StringType, etc. But we do require the fields 1. have the same order, and 2. have the same name (either with strict case, or ignoring case).

@marmbrus
Copy link
Contributor

marmbrus commented Jan 3, 2018

/cc @cloud-fan @sameeragarwal

@gatorsmile
Copy link
Member

Will review it today.

case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) =>
Some(TimestampType)

case (t1 @ ArrayType(pointType1, nullable1), t2 @ ArrayType(pointType2, nullable2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointType looks like a weird name, how about elementType or et?

Copy link
Author

@bdrillard bdrillard Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see 09e49fb

ArrayType(StringType, containsNull = true),
ArrayType(ArrayType(IntegerType), containsNull = true),
ArrayType(ArrayType(LongType), containsNull = false),
Some(ArrayType(ArrayType(LongType), containsNull = true)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really wanna do this? Is there any other database have this behavior?

I think for complex type, we should ignore the nullable difference, but I'm not sure if we should do type coercion inside complex type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91, thoughts on this?

It's definitely possible for us to revert back to behavior where we don't do IntegerType-to-LongType, xType-to-StringType, etc. promotion inside complex types, which was how a previous form of this PR handled it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more coherent with the behavior in other parts and this is the behavior I would expect. But I think that we should follow @gatorsmile's suggestion and check Hive's behavior first.

@SparkQA
Copy link

SparkQA commented Jan 4, 2018

Test build #85682 has finished for PR 20010 at commit 09e49fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Overall, it is reasonable. What is the current behavior in Hive?

@bdrillard
Copy link
Author

Is there a consensus on the preferred behavior here? This issue would also be a blocker to encoders for Spark-Avro in Spark 2.3.0 that @marmbrus mentions in #20085.

@mgaido91
Copy link
Contributor

@bdrillard I think that we need to know Hive's behavior as per #20010 (comment) to state which is the preferred/right behavior here.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 9, 2018

Test build #91613 has finished for PR 20010 at commit 09e49fb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I think this is already fixed.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants