[SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) #27153

mickjermsurawong-stripe · 2020-01-09T16:54:25Z

What changes were proposed in this pull request?

This PR revisits [SPARK-20384][SQL] Support value class in schema of Dataset #22309, and SPARK-20384 solving the original problem, but additionally will prevent backward-compat break on schema of top-level AnyVal value class.
Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test "schema for case class that is a value class". This PR keeps the schema.
We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too. See the tests asserting schema before any changes in 0cdad3b
However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened:
c7aaae8
With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior)

Why are the changes needed?

Currently, nested value class isn't supported. This is because when the generated code treats anyVal class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code
For example,
For a given AnyVal wrapper and its root-level class container

case class IntWrapper(i: Int) extends AnyVal
case class ComplexValueClassContainer(c: IntWrapper)

The problematic part of generated code:

    private InternalRow If_1(InternalRow i) {
        boolean isNull_42 = i.isNullAt(0);
        // 1) ******** The root-level case class we care
        org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
            null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
        if (isNull_42) {
            throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
        }
        boolean isNull_39 = true;
        // 2) ******** We specify its member to be unwrapped case class extending `AnyVal`
        org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
        if (!false) {

            isNull_39 = false;
            if (!isNull_39) {
                // 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error
                value_43 = value_46.c();
            }
        }

We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"

java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"

Does this PR introduce any user-facing change?

Yes, this will fix the bug.

How was this patch tested?

This PR ports all the tests from [SPARK-20384][SQL] Support value class in schema of Dataset #22309 and added a few more.

cc @mt40 please let me know if i'm missing something from your previous PR
cc @joshrosen-stripe

clean up

JoshRosen · 2020-01-09T21:47:41Z

jenkins this is ok to test

SparkQA · 2020-01-10T02:29:42Z

Test build #116417 has finished for PR 27153 at commit 6cb83b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SeqOfValueClass(s: Seq[StringWrapper])
case class MapOfValueClassKey(m: Map[IntWrapper, String])
case class MapOfValueClassValue(m: Map[String, StringWrapper])
case class OptionOfValueClassValue(o: Option[StringWrapper])

prestonph · 2020-01-14T02:29:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

+   * @return unwrapped param
+   */
+  private def unwrapValueClassParam(param: (String, `Type`)): (String, `Type`) = {
+    val (name, typ) = param


nit: maybe use tpe instead of typ to be consistent with the naming in this file

prestonph · 2020-01-14T02:30:40Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala

+  case class StrWrapper(s: String) extends AnyVal
+
+  case class ValueClassData(
+                             intField: Int,


please fix indentation here

prestonph · 2020-01-14T02:31:30Z

@mickjermsurawong-stripe thanks for continue with this feature, I hope it will be merged this time

mickjermsurawong-stripe · 2020-01-14T07:17:10Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    val df = spark.sparkContext.parallelize(Seq(a, b)).toDF
+    // flat value class, `s` field is not in schema
+    val filtered = df.where("wrapper = \"a\"")
+    checkAnswer(filtered, spark.sparkContext.parallelize(Seq(a)).toDF)


Before this change we never support nested value class:

Filter with wrapper would break with

org.apache.spark.sql.AnalysisException: cannot resolve '(`wrapper` = 'a')' due to data type mismatch: differing types in '(`wrapper` = 'a')' (struct<s:string> and string).; line 1 pos 0;

Filter with wrapper.s would break with:

java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.test.SQLTestData$StringWrapper

mickjermsurawong-stripe · 2020-01-14T07:19:12Z

Thanks @mt40!
Also added more tests on filtering on dataframe c23705d to illustrate the expected schemas more explicitly.

SparkQA · 2020-01-14T08:05:02Z

Test build #116688 has finished for PR 27153 at commit c23705d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
test(\"Value class filter\")
test(\"Array value class filter\")
test(\"Nested value class filter\")
case class StringWrapper(s: String) extends AnyVal
case class ArrayStringWrapper(wrappers: Seq[StringWrapper])
case class ContainerStringWrapper(wrapper: StringWrapper)

HyukjinKwon · 2020-01-20T06:40:21Z

retest this please

SparkQA · 2020-01-20T08:05:02Z

Test build #117086 has finished for PR 27153 at commit c23705d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
test(\"Value class filter\")
test(\"Array value class filter\")
test(\"Nested value class filter\")
case class StringWrapper(s: String) extends AnyVal
case class ArrayStringWrapper(wrappers: Seq[StringWrapper])
case class ContainerStringWrapper(wrapper: StringWrapper)

mickjermsurawong-stripe · 2020-02-07T17:46:27Z

^ bumping this please. I suspect the patch failure here is flaky; change from the last build is only adding more tests.

github-actions · 2020-05-18T00:13:13Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

eejbyfeldt · 2021-06-08T07:41:39Z

I am interested in support for value classes that is added/fixed in this branch. To me the changes looks like they are a still valid approach of adding the support.

Rebasing this branch on master will cause some of the added test cases to fail. This is due to a PR #31766 that fixed the so that interpreted path test are run properly and a bug for
CatalystToExternalMap in the interpreted path. I created a PR here to address that bug here: #32783

If/after the bug fix gets merged, would the next step towards getting this merged? Should I open a new PR with the changes from @mickjermsurawong-stripe branch? Since to me it looks like the original author is no longer around and/or interested in these changes.

prestonph · 2021-06-08T07:49:15Z

@eejbyfeldt Nice to see someone having the same interest in this. I think you should continue working on this branch to avoid duplication but it's up to you though.

In my opinion, the hardest part would be to convince Spark team that this change is necessary so this is can be merged once failed tests are fixed.

mickjermsurawong-stripe · 2021-06-09T23:12:48Z

hi @eejbyfeldt! We've had this equivalent change in our forked running in prod for a while now. One more follow-up we had internally is to handle value class wrapped in tuple like Dataset[(MyAnyValKey, SomeOtherCaseClass)]. I could revive this PR and add the said patch in the next few days.

eejbyfeldt · 2021-06-11T13:44:48Z

Hi @mickjermsurawong-stripe ! Great to hear, looking forward to updated changes. Also my PR #32783 got merged, so rebasing this branch on master should no longer fail tests.

eejbyfeldt · 2021-06-22T18:39:42Z

@mickjermsurawong-stripe Have you had anytime to look at updating the branch? I you don't have time to prepare full patch, I could help out with unittests and such if you point can provide some information about approximatly which change is needed for the value classes in tuples.

eejbyfeldt · 2021-07-02T09:44:27Z

I looked some more at this and realised what the problem with Tuples is, it guess it all related to when a value class it its underlying type and when it is not https://docs.scala-lang.org/overviews/core/value-classes.html#allocation-details

Based on this it seems something like the approach of the PR #22309 by @mt40 would be needed to handle all cases correctly. To me it also makes sense that a AnyVal would give the same schema regardless of where it is used like that PR. So unless there is someother good suggestion on this, that is the approach I am going to persue.

mickjermsurawong-stripe · 2021-07-02T16:06:05Z

Hey @eejbyfeldt sorry i realized that the internal fork we addressed this issue is in a different class of our custom encoder, so i didn't get to integrate it with the standard encoder here.

The underlying issue we addressed is exactly the allocation you mentioned: From doc on value class: https://docs.scala-lang.org/overviews/core/value-classes.html Given: class Wrapper(val underlying: Int) extends AnyVal,

"The type at compile time is Wrapper, but at runtime, the representation is an Int"
This implies that when our struct has a field of AnyVal case class, our generated code
should support the underlying type during runtime execution.
The Wrapper "must be instantiated... when a value class is used as a type argument".
This implies that scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper] will still contain Wrapper as-is in during runtime instead of Int.

Hope this could help. It's a long weekend here, so I will make sure I'll get to work on this. But please feel free to pursue independently as well.

mickjermsurawong-stripe · 2021-07-05T00:14:09Z

hi @eejbyfeldt i made another PR to address the tuple issue raised. #33205.
Would appreciate your review as well.

### What changes were proposed in this pull request? - This PR revisits #22309, and [SPARK-20384](https://issues.apache.org/jira/browse/SPARK-20384) solving the original problem, but additionally will prevent backward-compat break on schema of top-level `AnyVal` value class. - Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test `"schema for case class that is a value class"`. This PR keeps the schema. - We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too. - However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened - With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior) - This PR revisits #27153 by handling tuple `Tuple2[AnyVal, AnyVal]` which is a constructor ("nested class") but is a generic type, so it should not be flattened behaving similarly to `Seq[AnyVal]` ### Why are the changes needed? - Currently, nested value class isn't supported. This is because when the generated code treats `anyVal` class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code For example, For a given `AnyVal` wrapper and its root-level class container ``` case class IntWrapper(i: Int) extends AnyVal case class ComplexValueClassContainer(c: IntWrapper) ``` The problematic part of generated code: ``` private InternalRow If_1(InternalRow i) { boolean isNull_42 = i.isNullAt(0); // 1) ******** The root-level case class we care org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ? null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null)); if (isNull_42) { throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ )); } boolean isNull_39 = true; // 2) ******** We specify its member to be unwrapped case class extending `AnyVal` org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null; if (!false) { isNull_39 = false; if (!isNull_39) { // 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error value_43 = value_46.c(); } } ``` We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper" ``` java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper" ``` From [doc](https://docs.scala-lang.org/overviews/core/value-classes.html) on value class: , Given: `class Wrapper(val underlying: Int) extends AnyVal`, 1) "The type at compile time is `Wrapper`, but at runtime, the representation is an `Int`". This implies that when our struct has a field of value class, the generated code should support the underlying type during runtime execution. 2) `Wrapper` "must be instantiated... when a value class is used as a type argument". This implies that `scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper]` will still contain Wrapper as-is in during runtime instead of `Int`. ### Does this PR introduce _any_ user-facing change? - Yes, this will allow support for the nested value class. ### How was this patch tested? - Added unit tests to illustrate - raw schema - projection - round-trip encode/decode Closes #33205 from mickjermsurawong-stripe/SPARK-20384-2. Lead-authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com> Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Sean Owen <srowen@gmail.com>

### What changes were proposed in this pull request? - This PR revisits apache/spark#22309, and [SPARK-20384](https://issues.apache.org/jira/browse/SPARK-20384) solving the original problem, but additionally will prevent backward-compat break on schema of top-level `AnyVal` value class. - Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test `"schema for case class that is a value class"`. This PR keeps the schema. - We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too. - However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened - With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior) - This PR revisits apache/spark#27153 by handling tuple `Tuple2[AnyVal, AnyVal]` which is a constructor ("nested class") but is a generic type, so it should not be flattened behaving similarly to `Seq[AnyVal]` ### Why are the changes needed? - Currently, nested value class isn't supported. This is because when the generated code treats `anyVal` class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code For example, For a given `AnyVal` wrapper and its root-level class container ``` case class IntWrapper(i: Int) extends AnyVal case class ComplexValueClassContainer(c: IntWrapper) ``` The problematic part of generated code: ``` private InternalRow If_1(InternalRow i) { boolean isNull_42 = i.isNullAt(0); // 1) ******** The root-level case class we care org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ? null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null)); if (isNull_42) { throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ )); } boolean isNull_39 = true; // 2) ******** We specify its member to be unwrapped case class extending `AnyVal` org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null; if (!false) { isNull_39 = false; if (!isNull_39) { // 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error value_43 = value_46.c(); } } ``` We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper" ``` java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper" ``` From [doc](https://docs.scala-lang.org/overviews/core/value-classes.html) on value class: , Given: `class Wrapper(val underlying: Int) extends AnyVal`, 1) "The type at compile time is `Wrapper`, but at runtime, the representation is an `Int`". This implies that when our struct has a field of value class, the generated code should support the underlying type during runtime execution. 2) `Wrapper` "must be instantiated... when a value class is used as a type argument". This implies that `scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper]` will still contain Wrapper as-is in during runtime instead of `Int`. ### Does this PR introduce _any_ user-facing change? - Yes, this will allow support for the nested value class. ### How was this patch tested? - Added unit tests to illustrate - raw schema - projection - round-trip encode/decode Closes #33205 from mickjermsurawong-stripe/SPARK-20384-2. Lead-authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com> Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Sean Owen <srowen@gmail.com>

mickjermsurawong-stripe added 6 commits January 9, 2020 00:06

failing test on nested/complex value

dbeca9f

assert current spark schema for collection of values

0cdad3b

unwrap params passing encoder suite

6a835ac

clean up

fix test: inner value class unwrapped

c7aaae8

clean up case class instantiation

a8c13a3

add more test for nested collection

6cb83b4

prestonph reviewed Jan 14, 2020

View reviewed changes

mickjermsurawong-stripe added 2 commits January 13, 2020 23:10

fix nits

bc17471

add filter tests for explicitness in sql schema

c23705d

mickjermsurawong-stripe commented Jan 14, 2020

View reviewed changes

dongjoon-hyun added OPTIMIZER SQL and removed OPTIMIZER labels Feb 5, 2020

github-actions bot added the Stale label May 18, 2020

github-actions bot closed this May 19, 2020

mickjermsurawong-stripe mentioned this pull request Jul 5, 2021

[SPARK-20384][SQL] Support value class in nested schema for Dataset #33205

Closed

eejbyfeldt mentioned this pull request Jul 13, 2021

[WIP][SPARK-20384][SQL] Support value classes and always encoded as underlying type #33316

Closed

[SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) #27153

[SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) #27153

Uh oh!

Conversation

mickjermsurawong-stripe commented Jan 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

JoshRosen commented Jan 9, 2020

Uh oh!

SparkQA commented Jan 10, 2020

Uh oh!

prestonph Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

prestonph Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

prestonph commented Jan 14, 2020

Uh oh!

mickjermsurawong-stripe Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

mickjermsurawong-stripe commented Jan 14, 2020

Uh oh!

SparkQA commented Jan 14, 2020

Uh oh!

HyukjinKwon commented Jan 20, 2020

Uh oh!

SparkQA commented Jan 20, 2020

Uh oh!

mickjermsurawong-stripe commented Feb 7, 2020

Uh oh!

github-actions bot commented May 18, 2020

Uh oh!

eejbyfeldt commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prestonph commented Jun 8, 2021

Uh oh!

mickjermsurawong-stripe commented Jun 9, 2021

Uh oh!

eejbyfeldt commented Jun 11, 2021

Uh oh!

eejbyfeldt commented Jun 22, 2021

Uh oh!

eejbyfeldt commented Jul 2, 2021

Uh oh!

mickjermsurawong-stripe commented Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mickjermsurawong-stripe commented Jul 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mickjermsurawong-stripe commented Jan 9, 2020 •

edited

Loading

eejbyfeldt commented Jun 8, 2021 •

edited

Loading

mickjermsurawong-stripe commented Jul 2, 2021 •

edited

Loading