Skip to content

Conversation

@OlivierBlanvillain
Copy link
Contributor

@OlivierBlanvillain OlivierBlanvillain commented Oct 27, 2016

This is a proposition for a new syntax for column expression, which I think could be used as the only way to construct select / groupBy / joins in frameless, it would be an improvement for both #39 and #46. It's at super early stage, but I would like to get your feedback before going forward. Here the syntax I'm managed to get: (modulo type inference, but I would have to look into this further)

dataset.select[A, B](s => (s / 'a, s / 'b))
// Here we see the choice of / paying off: it's priority is greater than + or =
// Could ofc be changed to .apply/.col
dataset1.join(dataset2) { (s1, s2) =>
  s1 / 'a ===  s2 / 'b + 1
}

First commit is the interesting one, second commit is commenting every test to show that the trivial select examples are actually working.

@codecov-io
Copy link

Current coverage is 62.40% (diff: 22.85%)

Merging #60 into master will decrease coverage by 27.45%

@@             master        #60   diff @@
==========================================
  Files            23         22     -1   
  Lines           454        407    -47   
  Methods         448        402    -46   
  Messages          0          0          
  Branches          6          5     -1   
==========================================
- Hits            408        254   -154   
- Misses           46        153   +107   
  Partials          0          0          

Powered by Codecov. Last update 3753741...310d65b

@kanterov
Copy link
Contributor

What is reasoning behind dropping T from TypedColumn? Is it to support joins?

@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Oct 28, 2016

Yes. The idea is to remove Dataset#col as a way to create TypedColumn, and replace it by theColumnSyntax construct. select(c: TypedColumn[T, A]): Dataset[A] becomes select(f: ColumnSyntax[T] => TypedColumn[A]): Dataset[A], join becomes join(other: Dataset[U])(f: (ColumnSyntax[T], ColumnSyntax[U]) => TypedColumn[A]).

@imarios
Copy link
Contributor

imarios commented Oct 28, 2016

Essentially you suggest we move from TypedColumn[T,A] to a function ColumnSyntax[T] => TypedColumn[A] (maybe not exactly but this sort of captures the underlying idea). I think the idea is pretty cool!

However it does make the syntax a bit awkward:
d.select[A, B](s => (s / 'a, s / 'b))

Compared to something like this:
d.select(d('a), d('b))

The second syntax would look really familiar to someone using Dataframes in Spark (no learning curve here).

I would love to see something more powerful for this syntax, but from this example, I can't see the benefits to outweigh the change in the syntax ...

Can you show some more examples? Maybe contrast before and after or discuss what we can do better with option "a" compare to option "b"?

@kanterov
Copy link
Contributor

I see there are three perspectives of the problem:

  1. Having type safety, being able to reason about program with types, and compose columns and expressions
  2. Having nice syntax to build expressions in (1)
  3. Make (1) and (2) work with limitations of Scala compiler and Spark

In my opinion, having (2) is important, but we shouldn't trade off any of (1). Only having a solid and complete core would allow us to easily implement more features on top.

Huge limiting factor for (2) is (3). For instance, approaching (2) with implicit conversion from type-level literals to typed columns doesn't work.

ColumnSyntax

From what I understand, this change would drop a bit of (1), because it would be harder to work with columns as with values and compose them, because they become functions from ColumnSyntax to TypedColumn. I don't think this is a necessary trade-off. We should keep as much type information as possible.

I see having ColumnSyntax[T] similar to working with:

type ColumnSyntax[A] = TypedColumn[A, A]
def root[A]: TypedColumn[A, A]

Then equivalent for

dataset.select[A, B](s => (s / 'a, s / 'b))

is

// didn't use Witness to have clear examples
implicit class Syntax[A, B](self: TypedColumn[A, B]) {
  def /[C: TypedEncoder](s: Symbol)(implicit e: Exists[B, s.type, C]): TypedColumn[A, C] = ...
}

def col[A, B](s: Symbol)(implicit e: Exists[A, s.type, B]): TypedColumn[A, B] = ...

dataset.select(dataset.root / 'a, dataset.root / 'b)
// (1) if type inference works
dataset.select(root / 'a, root / 'b)
// (2) 
dataset.select(col('a), col('b))

Join can be described as

def join[A, B](da: TypedDataset[A], db: TypedDataset[B], expr: TypedColumn[(A, B), Boolean])

And then we can describe combinators that can help to have better syntax:

def joinIfEq(left: TypedColumn[A, C], right: TypedColumn[B, C]): TypedColumn[(A, B), C]

I think it should be possible to implement everything we need right now using Spark primitives and not having issues with type inference.

Having own ADT

In my opinion, the problem we have right now, that we are very limited on what Spark allows us to do. Instead of working in a language that is well defined and super typesafe we represent columns and datasets directly with Catalyst AST. It would be a huge improvement if we make our algebra that would describe datasets and columns, and then translate it to Spark separately, only when we need to execute a real Spark job. It would allow us instead of fighting Spark and type safety at the same time separate this into two different phases: building expressions and translating them to Spark in the very end.

I have no doubts that it's possible to build type safe core in Scala, that we can translate latter to Spark, and I don't think it is a problem given typesafe kernel to make pretty syntax for it. But to make it happen we have to solve each problem separately.

@OlivierBlanvillain
Copy link
Contributor Author

@kanterov Many thanks for the detailed feedback!

From what I understand, this change would drop a bit of (1), because it would be harder to work with columns as with values and compose them, because they become functions from ColumnSyntax to TypedColumn.

I actually think the two approaches are equivalent in therm of safety and composability. TypedColumns as they are currently implement do not really compose (either), if you build a complex column expression TypedColumn[S, A] over some schema S, there is not way to re-use it on another schema T, even if "String wise" that would make sense. Furthermore, the safety obtained with this first type parameter comes (precisely) from the fact that a TypedColumn[S, A] can only be used with schema S. If the only way to create TypedColumn instances is via ColumnSyntax (I don't think I removed other ways to create TypedColumn in the PR, but that's the intent), I think you get the exact same property, and it also scales nicely to joins & co.

Huge limiting factor for (2) is (3). For instance, approaching (2) with implicit conversion from type-level literals to typed columns doesn't work.

On that point, I have this gist which shows that a ColumnSyntax[S] => TypedColumn[A] function could lead to a very nice syntax with Singleton types. We could probably support both compiles, with the "nicest" syntax only available for TL-scala. It should be noted that TL-scala is not a completely new compiler, it's literally Lightbend-scala with 3 additional PRs merged. Because of the full bin-compat, it makes little difference compared to using any compiler plugin.

Having own ADT
It would be a huge improvement if we make our algebra that would describe datasets and columns, and then translate it to Spark separately, only when we need to execute a real Spark job. It would allow us instead of fighting Spark and type safety at the same time separate this into two different phases: building expressions and translating them to Spark in the very end.

I think we should keep this as a last resort as it would increase the cost of moving back and forth between Spark API and Frameless API (migration, learning curve). There is a lot of value in being "TypesafeDataset" and not something completely new with another level of indirection/interpretation.

@imarios
Copy link
Contributor

imarios commented Nov 19, 2016

I've been experimenting with @OlivierBlanvillain syntax. I think it has really good potentials. We all acknowledge that today if you have a typedDataset.select(...).groupBy(...) the syntax is not that good. You don't really have a good way of referencing the dataset that comes after select. You might need to do something like this to get around it:

val ds: TypedDataset[?] = ...
val a = ds.select(ds('a), ds('b), ...)
val f = a.groupBy(a('_1), a('_3))...

Which OK, but not that great.

Now, the main obstacle I found with @OlivierBlanvillain syntax, is type inference when we have overloaded selects (and groupBy's as well).

Here is a repl session that shows it (I tried to replicate this using a toy example outside of the project):

@ class Data[A] { df =>
     def sel[B](a: Data[A] => B): B = a(df)
  }
defined class Data
@ val d = new Data[Int]
d: Data[Int] = $sess.cmd3$Data@381f05a9
@ d.sel( a => 0.1 )
res5: Double = 0.1

// Great! This works great for the non overloaded sel (short for select)
@ class Data[A] { df =>
     def sel[B](a: Data[A] => B): B = a(df)
     def sel[B,C](a: Data[A] => (B,C)): (B,C) = a(df)
  }
defined class Data
@ val d = new Data[Int]
d: Data[Int] = $sess.cmd6$Data@35f79e6d

// However, when we have alternatives type inference doesn't work. 
@ d.sel( a => 0.1 )
cmd8.sc:1: missing parameter type
val res8 = d.sel( a => 0.1 )
                  ^
Compilation Failed

Any thoughts?

@OlivierBlanvillain
Copy link
Contributor Author

We could try to use a single method (instead of overloading), something like

def sel[Out](a: Data[A] => Out)(implicit e: ColumnTuple[Out]): Dataset[Out]

Where ColumnTuple[Out] checks that things have the expected shape

@imarios
Copy link
Contributor

imarios commented Nov 19, 2016

(side note from previous comment):
The syntax I had in mind (a slight variation of what @OlivierBlanvillain suggested) will work like this:

df.select(d => ( d('a) + 1, d('b) )).groupBy(g => g('_1))

@OlivierBlanvillain: I like the suggestion of using implicit for ad-hoc polymorphism instead of straight up overloading. Let me experiment with it a bit.

@imarios
Copy link
Contributor

imarios commented Nov 21, 2016

@OlivierBlanvillain I made good progress with your suggestion! Let me organize it a bit and create a PR for this (I think all these new syntax directions can be the focus of a 0.3 release).

@imarios
Copy link
Contributor

imarios commented Nov 25, 2016

Ok, I sort of changed the direction a bit and re-used a lot of the stuff we have. IDE type inference doesn't work that great, but REPL and compiler work pretty smooth:

Here is how it looks like:

case class Foo(i: Int, b: Long, c: String)
val f = TypedDataset.create( Foo(1,2,"a") :: Foo(2,2,"b") :: Foo(3,2,"b") :: Nil )

// Ignore function names for now :)
f.choose(x => x('c) :: x('i) :: HNil ).
  group(g => g('_1) :: HNil).
  aggr(a => sum(a('_2)) :: HNil).show().run
//+---+---+
//| _1| _2|
//+---+---+
//|  b|  5|
//|  a|  1|
//+---+---+

Currently you pass functions of [Out <: HList](func: TypedDataset[T] => Out)
It should be easy to make it be a Tuple instead of an HList in case someone is bothered that they have to import shapeless.HList.

Here is how choose works:

def choose[U <: HList, Out0 <: HList, Out](t: TypedDataset[T] => U)(implicit
     ct: ColumnTypes.Aux[T, U, Out0],
     toTraversable: ToTraversable.Aux[U, List, UntypedExpression[T]],
     tupler: Tupler.Aux[Out0, Out],
     encoder: TypedEncoder[Out]): TypedDataset[Out] = selectMany.applyProduct( t(self) )

I did something very similar for group and aggr. Essentially, I am providing a different wrapper around selectMany, groupByMany, and agg and I call their applyProduct.

This is what I have for now ...
Let me know what you think ...

Keep in mind that with this syntax we don't have to change anything from our default types (TypedColumn, TypedAggregate, etc.).

@OlivierBlanvillain
Copy link
Contributor Author

@imarios What's the motivation to expose an HList instead of a tuple here?

Keep in mind that with this syntax we don't have to change anything from our default types (TypedColumn, TypedAggregate, etc.).

One motivation with this syntax was to simplify these types. Indeed, if TypedColumn can only be create is the body of selects/joins (using whatever these functions get as arguments), the second type parameter on TypedColumn & co becomes optional. This would means fewer/simpler code in frameless (without comprimising typesafety), and probably better compilation speed (less things check at the typelevel: the "source" of TypedColumns is correct by construction!).

@OlivierBlanvillain
Copy link
Contributor Author

I need to find time to finish what I had in mind, so that we have something more concrete to argue about :P

@imarios
Copy link
Contributor

imarios commented Nov 30, 2016

No need to be an Hlist, the current code I have takes a Tuple. I see what you mean. Column[T,A] can be expressed as T => Column[A], so we are sparing of propagating T in Column.

@imarios
Copy link
Contributor

imarios commented Nov 30, 2016

This also means that we cannot have Column[A] created "in the wild". It always has to be inside a TypedDataset context in some way.

@OlivierBlanvillain
Copy link
Contributor Author

This also means that we cannot have Column[A] created "in the wild". It always has to be inside a TypedDataset context in some way.

T => Column[A] functions could be defined in the wild

@jeremyrsmith
Copy link
Contributor

@OlivierBlanvillain @kanterov @imarios:

What do you think of #110 as a "baby" step? When functions are used everywhere (if symbols were to be eliminated from syntax) then type inference gets a lot better.

I'll post another PR that builds on #110 with another small-ish macro for selectExpr, for example:

val ds2 = ds1.selectExpr(f => (f.a * 10, f.b + f.c / 20))

The implementation will just expand it to:

val ds2 = ds1.select(ds1(_.a) * 10, ds1(_.b) + ds1(_.c) / 20)

and everything proceeds from there in the same fashion as it currently does. Of course, it's possible to put in expressions that can't actually be converted into projections, but the macro will give a compile error in that case.

I think this will end up being easier because the syntax is expanded by just one macro, and everything else is implemented as it currently is (rather than struggling with new syntax-based typeclasses that IntelliJ will hate)

@OlivierBlanvillain
Copy link
Contributor Author

Closing in favor of #162.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants