Experiment with new function syntax #60

OlivierBlanvillain · 2016-10-27T22:41:03Z

This is a proposition for a new syntax for column expression, which I think could be used as the only way to construct select / groupBy / joins in frameless, it would be an improvement for both #39 and #46. It's at super early stage, but I would like to get your feedback before going forward. Here the syntax I'm managed to get: (modulo type inference, but I would have to look into this further)

dataset.select[A, B](s => (s / 'a, s / 'b))

// Here we see the choice of / paying off: it's priority is greater than + or =
// Could ofc be changed to .apply/.col
dataset1.join(dataset2) { (s1, s2) =>
  s1 / 'a ===  s2 / 'b + 1
}

First commit is the interesting one, second commit is commenting every test to show that the trivial select examples are actually working.

codecov-io · 2016-10-27T22:51:09Z

Current coverage is 62.40% (diff: 22.85%)

Merging #60 into master will decrease coverage by 27.45%

@@             master        #60   diff @@
==========================================
  Files            23         22     -1   
  Lines           454        407    -47   
  Methods         448        402    -46   
  Messages          0          0          
  Branches          6          5     -1   
==========================================
- Hits            408        254   -154   
- Misses           46        153   +107   
  Partials          0          0

Powered by Codecov. Last update 3753741...310d65b

kanterov · 2016-10-27T22:56:05Z

What is reasoning behind dropping T from TypedColumn? Is it to support joins?

OlivierBlanvillain · 2016-10-28T08:07:06Z

Yes. The idea is to remove Dataset#col as a way to create TypedColumn, and replace it by theColumnSyntax construct. select(c: TypedColumn[T, A]): Dataset[A] becomes select(f: ColumnSyntax[T] => TypedColumn[A]): Dataset[A], join becomes join(other: Dataset[U])(f: (ColumnSyntax[T], ColumnSyntax[U]) => TypedColumn[A]).

imarios · 2016-10-28T18:46:59Z

Essentially you suggest we move from TypedColumn[T,A] to a function ColumnSyntax[T] => TypedColumn[A] (maybe not exactly but this sort of captures the underlying idea). I think the idea is pretty cool!

However it does make the syntax a bit awkward:
d.select[A, B](s => (s / 'a, s / 'b))

Compared to something like this:
d.select(d('a), d('b))

The second syntax would look really familiar to someone using Dataframes in Spark (no learning curve here).

I would love to see something more powerful for this syntax, but from this example, I can't see the benefits to outweigh the change in the syntax ...

Can you show some more examples? Maybe contrast before and after or discuss what we can do better with option "a" compare to option "b"?

kanterov · 2016-10-29T20:46:46Z

I see there are three perspectives of the problem:

Having type safety, being able to reason about program with types, and compose columns and expressions
Having nice syntax to build expressions in (1)
Make (1) and (2) work with limitations of Scala compiler and Spark

In my opinion, having (2) is important, but we shouldn't trade off any of (1). Only having a solid and complete core would allow us to easily implement more features on top.

Huge limiting factor for (2) is (3). For instance, approaching (2) with implicit conversion from type-level literals to typed columns doesn't work.

ColumnSyntax

From what I understand, this change would drop a bit of (1), because it would be harder to work with columns as with values and compose them, because they become functions from ColumnSyntax to TypedColumn. I don't think this is a necessary trade-off. We should keep as much type information as possible.

I see having ColumnSyntax[T] similar to working with:

type ColumnSyntax[A] = TypedColumn[A, A]
def root[A]: TypedColumn[A, A]

Then equivalent for

dataset.select[A, B](s => (s / 'a, s / 'b))

is

// didn't use Witness to have clear examples
implicit class Syntax[A, B](self: TypedColumn[A, B]) {
  def /[C: TypedEncoder](s: Symbol)(implicit e: Exists[B, s.type, C]): TypedColumn[A, C] = ...
}

def col[A, B](s: Symbol)(implicit e: Exists[A, s.type, B]): TypedColumn[A, B] = ...

dataset.select(dataset.root / 'a, dataset.root / 'b)
// (1) if type inference works
dataset.select(root / 'a, root / 'b)
// (2) 
dataset.select(col('a), col('b))

Join can be described as

def join[A, B](da: TypedDataset[A], db: TypedDataset[B], expr: TypedColumn[(A, B), Boolean])

And then we can describe combinators that can help to have better syntax:

def joinIfEq(left: TypedColumn[A, C], right: TypedColumn[B, C]): TypedColumn[(A, B), C]

I think it should be possible to implement everything we need right now using Spark primitives and not having issues with type inference.

Having own ADT

In my opinion, the problem we have right now, that we are very limited on what Spark allows us to do. Instead of working in a language that is well defined and super typesafe we represent columns and datasets directly with Catalyst AST. It would be a huge improvement if we make our algebra that would describe datasets and columns, and then translate it to Spark separately, only when we need to execute a real Spark job. It would allow us instead of fighting Spark and type safety at the same time separate this into two different phases: building expressions and translating them to Spark in the very end.

I have no doubts that it's possible to build type safe core in Scala, that we can translate latter to Spark, and I don't think it is a problem given typesafe kernel to make pretty syntax for it. But to make it happen we have to solve each problem separately.

OlivierBlanvillain · 2016-10-29T23:20:57Z

@kanterov Many thanks for the detailed feedback!

From what I understand, this change would drop a bit of (1), because it would be harder to work with columns as with values and compose them, because they become functions from ColumnSyntax to TypedColumn.

I actually think the two approaches are equivalent in therm of safety and composability. TypedColumns as they are currently implement do not really compose (either), if you build a complex column expression TypedColumn[S, A] over some schema S, there is not way to re-use it on another schema T, even if "String wise" that would make sense. Furthermore, the safety obtained with this first type parameter comes (precisely) from the fact that a TypedColumn[S, A] can only be used with schema S. If the only way to create TypedColumn instances is via ColumnSyntax (I don't think I removed other ways to create TypedColumn in the PR, but that's the intent), I think you get the exact same property, and it also scales nicely to joins & co.

Huge limiting factor for (2) is (3). For instance, approaching (2) with implicit conversion from type-level literals to typed columns doesn't work.

On that point, I have this gist which shows that a ColumnSyntax[S] => TypedColumn[A] function could lead to a very nice syntax with Singleton types. We could probably support both compiles, with the "nicest" syntax only available for TL-scala. It should be noted that TL-scala is not a completely new compiler, it's literally Lightbend-scala with 3 additional PRs merged. Because of the full bin-compat, it makes little difference compared to using any compiler plugin.

Having own ADT
It would be a huge improvement if we make our algebra that would describe datasets and columns, and then translate it to Spark separately, only when we need to execute a real Spark job. It would allow us instead of fighting Spark and type safety at the same time separate this into two different phases: building expressions and translating them to Spark in the very end.

I think we should keep this as a last resort as it would increase the cost of moving back and forth between Spark API and Frameless API (migration, learning curve). There is a lot of value in being "TypesafeDataset" and not something completely new with another level of indirection/interpretation.

imarios · 2016-11-19T18:50:33Z

I've been experimenting with @OlivierBlanvillain syntax. I think it has really good potentials. We all acknowledge that today if you have a typedDataset.select(...).groupBy(...) the syntax is not that good. You don't really have a good way of referencing the dataset that comes after select. You might need to do something like this to get around it:

val ds: TypedDataset[?] = ...
val a = ds.select(ds('a), ds('b), ...)
val f = a.groupBy(a('_1), a('_3))...

Which OK, but not that great.

Now, the main obstacle I found with @OlivierBlanvillain syntax, is type inference when we have overloaded selects (and groupBy's as well).

Here is a repl session that shows it (I tried to replicate this using a toy example outside of the project):

@ class Data[A] { df =>
     def sel[B](a: Data[A] => B): B = a(df)
  }
defined class Data
@ val d = new Data[Int]
d: Data[Int] = $sess.cmd3$Data@381f05a9
@ d.sel( a => 0.1 )
res5: Double = 0.1

// Great! This works great for the non overloaded sel (short for select)
@ class Data[A] { df =>
     def sel[B](a: Data[A] => B): B = a(df)
     def sel[B,C](a: Data[A] => (B,C)): (B,C) = a(df)
  }
defined class Data
@ val d = new Data[Int]
d: Data[Int] = $sess.cmd6$Data@35f79e6d

// However, when we have alternatives type inference doesn't work. 
@ d.sel( a => 0.1 )
cmd8.sc:1: missing parameter type
val res8 = d.sel( a => 0.1 )
                  ^
Compilation Failed

Any thoughts?

OlivierBlanvillain · 2016-11-19T22:38:39Z

We could try to use a single method (instead of overloading), something like

def sel[Out](a: Data[A] => Out)(implicit e: ColumnTuple[Out]): Dataset[Out]

Where ColumnTuple[Out] checks that things have the expected shape

imarios · 2016-11-19T22:47:45Z

(side note from previous comment):
The syntax I had in mind (a slight variation of what @OlivierBlanvillain suggested) will work like this:

df.select(d => ( d('a) + 1, d('b) )).groupBy(g => g('_1))

@OlivierBlanvillain: I like the suggestion of using implicit for ad-hoc polymorphism instead of straight up overloading. Let me experiment with it a bit.

imarios · 2016-11-21T20:35:34Z

@OlivierBlanvillain I made good progress with your suggestion! Let me organize it a bit and create a PR for this (I think all these new syntax directions can be the focus of a 0.3 release).

imarios · 2016-11-25T23:07:42Z

Ok, I sort of changed the direction a bit and re-used a lot of the stuff we have. IDE type inference doesn't work that great, but REPL and compiler work pretty smooth:

Here is how it looks like:

case class Foo(i: Int, b: Long, c: String)
val f = TypedDataset.create( Foo(1,2,"a") :: Foo(2,2,"b") :: Foo(3,2,"b") :: Nil )

// Ignore function names for now :)
f.choose(x => x('c) :: x('i) :: HNil ).
  group(g => g('_1) :: HNil).
  aggr(a => sum(a('_2)) :: HNil).show().run
//+---+---+
//| _1| _2|
//+---+---+
//|  b|  5|
//|  a|  1|
//+---+---+

Currently you pass functions of [Out <: HList](func: TypedDataset[T] => Out)
It should be easy to make it be a Tuple instead of an HList in case someone is bothered that they have to import shapeless.HList.

Here is how choose works:

def choose[U <: HList, Out0 <: HList, Out](t: TypedDataset[T] => U)(implicit
     ct: ColumnTypes.Aux[T, U, Out0],
     toTraversable: ToTraversable.Aux[U, List, UntypedExpression[T]],
     tupler: Tupler.Aux[Out0, Out],
     encoder: TypedEncoder[Out]): TypedDataset[Out] = selectMany.applyProduct( t(self) )

I did something very similar for group and aggr. Essentially, I am providing a different wrapper around selectMany, groupByMany, and agg and I call their applyProduct.

This is what I have for now ...
Let me know what you think ...

Keep in mind that with this syntax we don't have to change anything from our default types (TypedColumn, TypedAggregate, etc.).

OlivierBlanvillain · 2016-11-30T13:28:43Z

@imarios What's the motivation to expose an HList instead of a tuple here?

Keep in mind that with this syntax we don't have to change anything from our default types (TypedColumn, TypedAggregate, etc.).

One motivation with this syntax was to simplify these types. Indeed, if TypedColumn can only be create is the body of selects/joins (using whatever these functions get as arguments), the second type parameter on TypedColumn & co becomes optional. This would means fewer/simpler code in frameless (without comprimising typesafety), and probably better compilation speed (less things check at the typelevel: the "source" of TypedColumns is correct by construction!).

OlivierBlanvillain · 2016-11-30T13:29:32Z

I need to find time to finish what I had in mind, so that we have something more concrete to argue about :P

imarios · 2016-11-30T18:07:03Z

No need to be an Hlist, the current code I have takes a Tuple. I see what you mean. Column[T,A] can be expressed as T => Column[A], so we are sparing of propagating T in Column.

imarios · 2016-11-30T18:08:59Z

This also means that we cannot have Column[A] created "in the wild". It always has to be inside a TypedDataset context in some way.

OlivierBlanvillain · 2016-12-01T09:19:44Z

This also means that we cannot have Column[A] created "in the wild". It always has to be inside a TypedDataset context in some way.

T => Column[A] functions could be defined in the wild

jeremyrsmith · 2017-02-19T00:34:22Z

@OlivierBlanvillain @kanterov @imarios:

What do you think of #110 as a "baby" step? When functions are used everywhere (if symbols were to be eliminated from syntax) then type inference gets a lot better.

I'll post another PR that builds on #110 with another small-ish macro for selectExpr, for example:

val ds2 = ds1.selectExpr(f => (f.a * 10, f.b + f.c / 20))

The implementation will just expand it to:

val ds2 = ds1.select(ds1(_.a) * 10, ds1(_.b) + ds1(_.c) / 20)

and everything proceeds from there in the same fashion as it currently does. Of course, it's possible to put in expressions that can't actually be converted into projections, but the macro will give a compile error in that case.

I think this will end up being easier because the syntax is expanded by just one macro, and everything else is implemented as it currently is (rather than struggling with new syntax-based typeclasses that IntelliJ will hate)

OlivierBlanvillain · 2017-09-10T07:54:31Z

Closing in favor of #162.

OlivierBlanvillain added 2 commits October 28, 2016 00:31

Experiment with new function syntax

6aab279

Comment most tests

310d65b

OlivierBlanvillain closed this Sep 10, 2017

Uh oh!

Experiment with new function syntax #60

Experiment with new function syntax #60

Uh oh!

Conversation

OlivierBlanvillain commented Oct 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Oct 27, 2016

Current coverage is 62.40% (diff: 22.85%)

Uh oh!

kanterov commented Oct 27, 2016

Uh oh!

OlivierBlanvillain commented Oct 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imarios commented Oct 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kanterov commented Oct 29, 2016

ColumnSyntax

Having own ADT

Uh oh!

OlivierBlanvillain commented Oct 29, 2016

Uh oh!

imarios commented Nov 19, 2016

Uh oh!

OlivierBlanvillain commented Nov 19, 2016

Uh oh!

imarios commented Nov 19, 2016

Uh oh!

imarios commented Nov 21, 2016

Uh oh!

imarios commented Nov 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OlivierBlanvillain commented Nov 30, 2016

Uh oh!

OlivierBlanvillain commented Nov 30, 2016

Uh oh!

imarios commented Nov 30, 2016

Uh oh!

imarios commented Nov 30, 2016

Uh oh!

OlivierBlanvillain commented Dec 1, 2016

Uh oh!

jeremyrsmith commented Feb 19, 2017

Uh oh!

OlivierBlanvillain commented Sep 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

OlivierBlanvillain commented Oct 27, 2016 •

edited

Loading

OlivierBlanvillain commented Oct 28, 2016 •

edited

Loading

imarios commented Oct 28, 2016 •

edited

Loading

imarios commented Nov 25, 2016 •

edited

Loading