Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnyVal blows up TypedEncoder / Injection #161

Closed
Voltir opened this issue Jul 21, 2017 · 4 comments
Closed

AnyVal blows up TypedEncoder / Injection #161

Voltir opened this issue Jul 21, 2017 · 4 comments
Labels

Comments

@Voltir
Copy link

Voltir commented Jul 21, 2017

import frameless._
import frameless.syntax._

case class CustomerId(value: Long) //extends AnyVal

object CustomerId {
  implicit val _injection: Injection[CustomerId, Long] =
    Injection[CustomerId, Long](_.value,CustomerId.apply)

  implicit val _encoder: TypedEncoder[CustomerId] =
    frameless.TypedEncoder.usingInjection[CustomerId, Long]
}

Im using CustomerId in a larger Product type, if I decide to uncomment the //extends AnyVal I get a lovely explosion in a catalyst.codegen step:

(P.S - It seems odd that i have to explicitly add the field _encoder -- for some reason, the implicit def for usingInjection isnt being implicitly found - but im not ruling out me forgetting something basic about implicits)

[info] - can load the past travel sample *** FAILED ***
[info]   java.lang.RuntimeException: Error while decoding: java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 805, Column 89: No applicable constructor/method found for actual parameters "ai.takt.data.valkyrie.domain.CustomerId, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option"; candidates are: "ai.takt.data.valkyrie.csv.schemas.PastTravel$Row(long, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option)"
[info] /* 001 */ public java.lang.Object generate(Object[] references) {
[info] /* 002 */   return new SpecificSafeProjection(references);
[info] /* 003 */ }
[info] /* 004 */
[info] /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
[info] /* 006 */
[info] /* 007 */   private Object[] references;
[info] /* 008 */   private InternalRow mutableRow;
[info] /* 009 */   private boolean resultIsNull;
[info] /* 010 */   private ai.takt.data.valkyrie.domain.CustomerId argValue;
[info] /* 011 */   private scala.Option argValue1;
[info] /* 012 */   private scala.Option argValue2;
[info] /* 013 */   private scala.Option argValue3;
[info] /* 014 */   private scala.Option argValue4;
[info] /* 015 */   private scala.Option argValue5;
[info] /* 016 */   private scala.Option argValue6;
@imarios
Copy link
Contributor

imarios commented Jul 21, 2017

Good catch! Ok, can you remove the injection? Any case class should be straightforwardly encoded without Injection.

Also, what version are you on?

@imarios
Copy link
Contributor

imarios commented Jul 21, 2017

The following works just fine for me:

import frameless._
import frameless.syntax._

case class CustomerId(value: Long) //extends AnyVal

object CustomerId {
  implicit val _injection: Injection[CustomerId, Long] =
    Injection[CustomerId, Long](_.value,CustomerId.apply)

  implicit val _encoder: TypedEncoder[CustomerId] =
    frameless.TypedEncoder.usingInjection[CustomerId, Long]
}

TypedDataset.create(Seq(CustomerId(2))).show().run

Can you provide a minimal example that showcases this issue?

@kanterov
Copy link
Contributor

kanterov commented Jul 22, 2017

I've reproduced this issue with following code:

case class CustomerId(value: Long) extends AnyVal

object CustomerId {
  implicit def injection: Injection[CustomerId, Long] =
    Injection(_.value, CustomerId(_))

  implicit def encoder: TypedEncoder[CustomerId] =
    TypedEncoder.usingInjection[CustomerId, Long]

  implicit def arbitrary: Arbitrary[CustomerId] = Arbitrary {
    Arbitrary.arbLong.arbitrary.map(CustomerId(_))
  }
}

case class Customer(id: CustomerId, name: String)

object Customer {
  implicit def arbitrary: Arbitrary[Customer] = Arbitrary {
    for {
      id <- Arbitrary.arbitrary[CustomerId]
      name <- Gen.alphaNumStr
    } yield Customer(id, name)
  }
}

There are 3 related issues:

  1. TypedEncoder derivation isn't friendly if you have both injection and case class, or regular AnyVal class with a single field.
  2. Invoke in Spark always assumes that output is nullable unless it's a primitive type, therefore codegen isn't optimal, it was fixed in Spark 2.2.0, we should start using returnNullable once we migrate, see apache/spark@38b9e69
  3. The way Scala encoders AnyVal if it's a field in case class is just creating a method to return an underlying object:
case class Customer(id: CustomerId)

becomes

class Customer {
  long id() { ... }
}

But we generate code assuming id() would return CustomerId. it's possible to fix this using TypeTag instead of ClassTag. It would involve using a lot of reflection and could be very dirty.

The other way is to provide specialize encoder for AnyVal automatically doing "fixed" injection that would work, that makes a lot of sense to be a default behavior. I think we should go with the second approach.

One of workarounds is to manually declare encoders:

  implicit def encoder: TypedEncoder[CustomerId] = new TypedEncoder[CustomerId] {
    def nullable: Boolean = false
    def sourceDataType: DataType = LongType
    def targetDataType: DataType = LongType

    def constructorFor(path: Expression): Expression = path
    def extractorFor(path: Expression): Expression = path
  }

But it would work only if CustomerId is a field of a case class, it wouldn't work to collect DataFrame of CustomerId, or call UDF that accepts CustomerId.

@cchantep
Copy link
Collaborator

cchantep commented Sep 7, 2021

Closed by #546

@cchantep cchantep closed this as completed Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants