Skip to content

Conversation

@mickjermsurawong-stripe
Copy link
Contributor

@mickjermsurawong-stripe mickjermsurawong-stripe commented Jan 9, 2020

What changes were proposed in this pull request?

  • This PR revisits [SPARK-20384][SQL] Support value class in schema of Dataset #22309, and SPARK-20384 solving the original problem, but additionally will prevent backward-compat break on schema of top-level AnyVal value class.
  • Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test "schema for case class that is a value class". This PR keeps the schema.
  • We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too. See the tests asserting schema before any changes in 0cdad3b
  • However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened:
    c7aaae8
  • With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior)

Why are the changes needed?

  • Currently, nested value class isn't supported. This is because when the generated code treats anyVal class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code
    For example,
    For a given AnyVal wrapper and its root-level class container
case class IntWrapper(i: Int) extends AnyVal
case class ComplexValueClassContainer(c: IntWrapper)

The problematic part of generated code:

    private InternalRow If_1(InternalRow i) {
        boolean isNull_42 = i.isNullAt(0);
        // 1) ******** The root-level case class we care
        org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
            null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
        if (isNull_42) {
            throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
        }
        boolean isNull_39 = true;
        // 2) ******** We specify its member to be unwrapped case class extending `AnyVal`
        org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
        if (!false) {

            isNull_39 = false;
            if (!isNull_39) {
                // 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error
                value_43 = value_46.c();
            }
        }

We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"

java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"

Does this PR introduce any user-facing change?

  • Yes, this will fix the bug.

How was this patch tested?

cc @mt40 please let me know if i'm missing something from your previous PR
cc @joshrosen-stripe

@JoshRosen
Copy link
Contributor

jenkins this is ok to test

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #116417 has finished for PR 27153 at commit 6cb83b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SeqOfValueClass(s: Seq[StringWrapper])
  • case class MapOfValueClassKey(m: Map[IntWrapper, String])
  • case class MapOfValueClassValue(m: Map[String, StringWrapper])
  • case class OptionOfValueClassValue(o: Option[StringWrapper])

* @return unwrapped param
*/
private def unwrapValueClassParam(param: (String, `Type`)): (String, `Type`) = {
val (name, typ) = param

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe use tpe instead of typ to be consistent with the naming in this file

case class StrWrapper(s: String) extends AnyVal

case class ValueClassData(
intField: Int,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix indentation here

@prestonph
Copy link

@mickjermsurawong-stripe thanks for continue with this feature, I hope it will be merged this time

val df = spark.sparkContext.parallelize(Seq(a, b)).toDF
// flat value class, `s` field is not in schema
val filtered = df.where("wrapper = \"a\"")
checkAnswer(filtered, spark.sparkContext.parallelize(Seq(a)).toDF)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change we never support nested value class:

  • Filter with wrapper would break with
org.apache.spark.sql.AnalysisException: cannot resolve '(`wrapper` = 'a')' due to data type mismatch: differing types in '(`wrapper` = 'a')' (struct<s:string> and string).; line 1 pos 0;
  • Filter with wrapper.s would break with:
java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.test.SQLTestData$StringWrapper

@mickjermsurawong-stripe
Copy link
Contributor Author

Thanks @mt40!
Also added more tests on filtering on dataframe c23705d to illustrate the expected schemas more explicitly.

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116688 has finished for PR 27153 at commit c23705d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • test(\"Value class filter\")
  • test(\"Array value class filter\")
  • test(\"Nested value class filter\")
  • case class StringWrapper(s: String) extends AnyVal
  • case class ArrayStringWrapper(wrappers: Seq[StringWrapper])
  • case class ContainerStringWrapper(wrapper: StringWrapper)

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117086 has finished for PR 27153 at commit c23705d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • test(\"Value class filter\")
  • test(\"Array value class filter\")
  • test(\"Nested value class filter\")
  • case class StringWrapper(s: String) extends AnyVal
  • case class ArrayStringWrapper(wrappers: Seq[StringWrapper])
  • case class ContainerStringWrapper(wrapper: StringWrapper)

@mickjermsurawong-stripe
Copy link
Contributor Author

^ bumping this please. I suspect the patch failure here is flaky; change from the last build is only adding more tests.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 18, 2020
@github-actions github-actions bot closed this May 19, 2020
@eejbyfeldt
Copy link
Contributor

eejbyfeldt commented Jun 8, 2021

I am interested in support for value classes that is added/fixed in this branch. To me the changes looks like they are a still valid approach of adding the support.

Rebasing this branch on master will cause some of the added test cases to fail. This is due to a PR #31766 that fixed the so that interpreted path test are run properly and a bug for
CatalystToExternalMap in the interpreted path. I created a PR here to address that bug here: #32783

If/after the bug fix gets merged, would the next step towards getting this merged? Should I open a new PR with the changes from @mickjermsurawong-stripe branch? Since to me it looks like the original author is no longer around and/or interested in these changes.

@prestonph
Copy link

@eejbyfeldt Nice to see someone having the same interest in this. I think you should continue working on this branch to avoid duplication but it's up to you though.

In my opinion, the hardest part would be to convince Spark team that this change is necessary so this is can be merged once failed tests are fixed.

@mickjermsurawong-stripe
Copy link
Contributor Author

hi @eejbyfeldt! We've had this equivalent change in our forked running in prod for a while now. One more follow-up we had internally is to handle value class wrapped in tuple like Dataset[(MyAnyValKey, SomeOtherCaseClass)]. I could revive this PR and add the said patch in the next few days.

@eejbyfeldt
Copy link
Contributor

Hi @mickjermsurawong-stripe ! Great to hear, looking forward to updated changes. Also my PR #32783 got merged, so rebasing this branch on master should no longer fail tests.

@eejbyfeldt
Copy link
Contributor

@mickjermsurawong-stripe Have you had anytime to look at updating the branch? I you don't have time to prepare full patch, I could help out with unittests and such if you point can provide some information about approximatly which change is needed for the value classes in tuples.

@eejbyfeldt
Copy link
Contributor

I looked some more at this and realised what the problem with Tuples is, it guess it all related to when a value class it its underlying type and when it is not https://docs.scala-lang.org/overviews/core/value-classes.html#allocation-details

Based on this it seems something like the approach of the PR #22309 by @mt40 would be needed to handle all cases correctly. To me it also makes sense that a AnyVal would give the same schema regardless of where it is used like that PR. So unless there is someother good suggestion on this, that is the approach I am going to persue.

@mickjermsurawong-stripe
Copy link
Contributor Author

mickjermsurawong-stripe commented Jul 2, 2021

Hey @eejbyfeldt sorry i realized that the internal fork we addressed this issue is in a different class of our custom encoder, so i didn't get to integrate it with the standard encoder here.

The underlying issue we addressed is exactly the allocation you mentioned: From doc on value class: https://docs.scala-lang.org/overviews/core/value-classes.html Given: class Wrapper(val underlying: Int) extends AnyVal,

  • "The type at compile time is Wrapper, but at runtime, the representation is an Int"
    This implies that when our struct has a field of AnyVal case class, our generated code
    should support the underlying type during runtime execution.

  • The Wrapper "must be instantiated... when a value class is used as a type argument".
    This implies that scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper] will still contain Wrapper as-is in during runtime instead of Int.

Hope this could help. It's a long weekend here, so I will make sure I'll get to work on this. But please feel free to pursue independently as well.

@mickjermsurawong-stripe
Copy link
Contributor Author

hi @eejbyfeldt i made another PR to address the tuple issue raised. #33205.
Would appreciate your review as well.

srowen pushed a commit that referenced this pull request Aug 9, 2021
### What changes were proposed in this pull request?

- This PR revisits #22309, and [SPARK-20384](https://issues.apache.org/jira/browse/SPARK-20384) solving the original problem, but additionally will prevent backward-compat break on schema of top-level `AnyVal` value class.
- Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test `"schema for case class that is a value class"`. This PR keeps the schema.
- We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too.
- However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened
- With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior)
- This PR revisits #27153 by handling tuple `Tuple2[AnyVal, AnyVal]` which is a constructor ("nested class") but is a generic type, so it should not be flattened behaving similarly to `Seq[AnyVal]`

### Why are the changes needed?

- Currently, nested value class isn't supported. This is because when the generated code treats `anyVal` class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code
For example,
For a given `AnyVal` wrapper and its root-level class container
```
case class IntWrapper(i: Int) extends AnyVal
case class ComplexValueClassContainer(c: IntWrapper)
```
The problematic part of generated code:
```
    private InternalRow If_1(InternalRow i) {
        boolean isNull_42 = i.isNullAt(0);
        // 1) ******** The root-level case class we care
        org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
            null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
        if (isNull_42) {
            throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
        }
        boolean isNull_39 = true;
        // 2) ******** We specify its member to be unwrapped case class extending `AnyVal`
        org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
        if (!false) {

            isNull_39 = false;
            if (!isNull_39) {
                // 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error
                value_43 = value_46.c();
            }
        }
```
We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
```
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException:
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
```

From [doc](https://docs.scala-lang.org/overviews/core/value-classes.html) on value class: , Given: `class Wrapper(val underlying: Int) extends AnyVal`,
1) "The type at compile time is `Wrapper`, but at runtime, the representation is an `Int`". This implies that when our struct has a field of value class, the generated code should support the underlying type during runtime execution.
2) `Wrapper` "must be instantiated... when a value class is used as a type argument". This implies that `scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper]` will still contain Wrapper as-is in during runtime instead of `Int`.

### Does this PR introduce _any_ user-facing change?

- Yes, this will allow support for the nested value class.

### How was this patch tested?

- Added unit tests to illustrate
  - raw schema
  - projection
  - round-trip encode/decode

Closes #33205 from mickjermsurawong-stripe/SPARK-20384-2.

Lead-authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com>
Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Aug 9, 2021
### What changes were proposed in this pull request?

- This PR revisits apache/spark#22309, and [SPARK-20384](https://issues.apache.org/jira/browse/SPARK-20384) solving the original problem, but additionally will prevent backward-compat break on schema of top-level `AnyVal` value class.
- Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test `"schema for case class that is a value class"`. This PR keeps the schema.
- We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too.
- However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened
- With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior)
- This PR revisits apache/spark#27153 by handling tuple `Tuple2[AnyVal, AnyVal]` which is a constructor ("nested class") but is a generic type, so it should not be flattened behaving similarly to `Seq[AnyVal]`

### Why are the changes needed?

- Currently, nested value class isn't supported. This is because when the generated code treats `anyVal` class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code
For example,
For a given `AnyVal` wrapper and its root-level class container
```
case class IntWrapper(i: Int) extends AnyVal
case class ComplexValueClassContainer(c: IntWrapper)
```
The problematic part of generated code:
```
    private InternalRow If_1(InternalRow i) {
        boolean isNull_42 = i.isNullAt(0);
        // 1) ******** The root-level case class we care
        org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
            null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
        if (isNull_42) {
            throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
        }
        boolean isNull_39 = true;
        // 2) ******** We specify its member to be unwrapped case class extending `AnyVal`
        org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
        if (!false) {

            isNull_39 = false;
            if (!isNull_39) {
                // 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error
                value_43 = value_46.c();
            }
        }
```
We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
```
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException:
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
```

From [doc](https://docs.scala-lang.org/overviews/core/value-classes.html) on value class: , Given: `class Wrapper(val underlying: Int) extends AnyVal`,
1) "The type at compile time is `Wrapper`, but at runtime, the representation is an `Int`". This implies that when our struct has a field of value class, the generated code should support the underlying type during runtime execution.
2) `Wrapper` "must be instantiated... when a value class is used as a type argument". This implies that `scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper]` will still contain Wrapper as-is in during runtime instead of `Int`.

### Does this PR introduce _any_ user-facing change?

- Yes, this will allow support for the nested value class.

### How was this patch tested?

- Added unit tests to illustrate
  - raw schema
  - projection
  - round-trip encode/decode

Closes #33205 from mickjermsurawong-stripe/SPARK-20384-2.

Lead-authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com>
Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants