-
-
Notifications
You must be signed in to change notification settings - Fork 138
Experiment with new function syntax #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment with new function syntax #60
Conversation
Current coverage is 62.40% (diff: 22.85%)@@ master #60 diff @@
==========================================
Files 23 22 -1
Lines 454 407 -47
Methods 448 402 -46
Messages 0 0
Branches 6 5 -1
==========================================
- Hits 408 254 -154
- Misses 46 153 +107
Partials 0 0
|
|
What is reasoning behind dropping |
|
Yes. The idea is to remove |
|
Essentially you suggest we move from However it does make the syntax a bit awkward: Compared to something like this: The second syntax would look really familiar to someone using Dataframes in Spark (no learning curve here). I would love to see something more powerful for this syntax, but from this example, I can't see the benefits to outweigh the change in the syntax ... Can you show some more examples? Maybe contrast before and after or discuss what we can do better with option "a" compare to option "b"? |
|
I see there are three perspectives of the problem:
In my opinion, having (2) is important, but we shouldn't trade off any of (1). Only having a solid and complete core would allow us to easily implement more features on top. Huge limiting factor for (2) is (3). For instance, approaching (2) with implicit conversion from type-level literals to typed columns doesn't work. ColumnSyntaxFrom what I understand, this change would drop a bit of (1), because it would be harder to work with columns as with values and compose them, because they become functions from I see having type ColumnSyntax[A] = TypedColumn[A, A]
def root[A]: TypedColumn[A, A]Then equivalent for dataset.select[A, B](s => (s / 'a, s / 'b))is // didn't use Witness to have clear examples
implicit class Syntax[A, B](self: TypedColumn[A, B]) {
def /[C: TypedEncoder](s: Symbol)(implicit e: Exists[B, s.type, C]): TypedColumn[A, C] = ...
}
def col[A, B](s: Symbol)(implicit e: Exists[A, s.type, B]): TypedColumn[A, B] = ...
dataset.select(dataset.root / 'a, dataset.root / 'b)
// (1) if type inference works
dataset.select(root / 'a, root / 'b)
// (2)
dataset.select(col('a), col('b))Join can be described as def join[A, B](da: TypedDataset[A], db: TypedDataset[B], expr: TypedColumn[(A, B), Boolean])And then we can describe combinators that can help to have better syntax: def joinIfEq(left: TypedColumn[A, C], right: TypedColumn[B, C]): TypedColumn[(A, B), C]I think it should be possible to implement everything we need right now using Spark primitives and not having issues with type inference. Having own ADTIn my opinion, the problem we have right now, that we are very limited on what Spark allows us to do. Instead of working in a language that is well defined and super typesafe we represent columns and datasets directly with Catalyst AST. It would be a huge improvement if we make our algebra that would describe datasets and columns, and then translate it to Spark separately, only when we need to execute a real Spark job. It would allow us instead of fighting Spark and type safety at the same time separate this into two different phases: building expressions and translating them to Spark in the very end. I have no doubts that it's possible to build type safe core in Scala, that we can translate latter to Spark, and I don't think it is a problem given typesafe kernel to make pretty syntax for it. But to make it happen we have to solve each problem separately. |
|
@kanterov Many thanks for the detailed feedback!
I actually think the two approaches are equivalent in therm of safety and composability.
On that point, I have this gist which shows that a
I think we should keep this as a last resort as it would increase the cost of moving back and forth between Spark API and Frameless API (migration, learning curve). There is a lot of value in being "TypesafeDataset" and not something completely new with another level of indirection/interpretation. |
|
I've been experimenting with @OlivierBlanvillain syntax. I think it has really good potentials. We all acknowledge that today if you have a val ds: TypedDataset[?] = ...
val a = ds.select(ds('a), ds('b), ...)
val f = a.groupBy(a('_1), a('_3))...Which OK, but not that great. Now, the main obstacle I found with @OlivierBlanvillain syntax, is type inference when we have overloaded selects (and groupBy's as well). Here is a repl session that shows it (I tried to replicate this using a toy example outside of the project): @ class Data[A] { df =>
def sel[B](a: Data[A] => B): B = a(df)
}
defined class Data
@ val d = new Data[Int]
d: Data[Int] = $sess.cmd3$Data@381f05a9
@ d.sel( a => 0.1 )
res5: Double = 0.1
// Great! This works great for the non overloaded sel (short for select)
@ class Data[A] { df =>
def sel[B](a: Data[A] => B): B = a(df)
def sel[B,C](a: Data[A] => (B,C)): (B,C) = a(df)
}
defined class Data
@ val d = new Data[Int]
d: Data[Int] = $sess.cmd6$Data@35f79e6d
// However, when we have alternatives type inference doesn't work.
@ d.sel( a => 0.1 )
cmd8.sc:1: missing parameter type
val res8 = d.sel( a => 0.1 )
^
Compilation FailedAny thoughts? |
|
We could try to use a single method (instead of overloading), something like def sel[Out](a: Data[A] => Out)(implicit e: ColumnTuple[Out]): Dataset[Out]Where |
|
(side note from previous comment): df.select(d => ( d('a) + 1, d('b) )).groupBy(g => g('_1))@OlivierBlanvillain: I like the suggestion of using implicit for ad-hoc polymorphism instead of straight up overloading. Let me experiment with it a bit. |
|
@OlivierBlanvillain I made good progress with your suggestion! Let me organize it a bit and create a PR for this (I think all these new syntax directions can be the focus of a 0.3 release). |
|
Ok, I sort of changed the direction a bit and re-used a lot of the stuff we have. IDE type inference doesn't work that great, but REPL and compiler work pretty smooth: Here is how it looks like: case class Foo(i: Int, b: Long, c: String)
val f = TypedDataset.create( Foo(1,2,"a") :: Foo(2,2,"b") :: Foo(3,2,"b") :: Nil )
// Ignore function names for now :)
f.choose(x => x('c) :: x('i) :: HNil ).
group(g => g('_1) :: HNil).
aggr(a => sum(a('_2)) :: HNil).show().run
//+---+---+
//| _1| _2|
//+---+---+
//| b| 5|
//| a| 1|
//+---+---+Currently you pass functions of Here is how def choose[U <: HList, Out0 <: HList, Out](t: TypedDataset[T] => U)(implicit
ct: ColumnTypes.Aux[T, U, Out0],
toTraversable: ToTraversable.Aux[U, List, UntypedExpression[T]],
tupler: Tupler.Aux[Out0, Out],
encoder: TypedEncoder[Out]): TypedDataset[Out] = selectMany.applyProduct( t(self) )I did something very similar for group and aggr. Essentially, I am providing a different wrapper around This is what I have for now ... Keep in mind that with this syntax we don't have to change anything from our default types (TypedColumn, TypedAggregate, etc.). |
|
@imarios What's the motivation to expose an
One motivation with this syntax was to simplify these types. Indeed, if |
|
I need to find time to finish what I had in mind, so that we have something more concrete to argue about :P |
|
No need to be an Hlist, the current code I have takes a Tuple. I see what you mean. |
|
This also means that we cannot have Column[A] created "in the wild". It always has to be inside a TypedDataset context in some way. |
|
|
@OlivierBlanvillain @kanterov @imarios: What do you think of #110 as a "baby" step? When functions are used everywhere (if symbols were to be eliminated from syntax) then type inference gets a lot better. I'll post another PR that builds on #110 with another small-ish macro for val ds2 = ds1.selectExpr(f => (f.a * 10, f.b + f.c / 20))The implementation will just expand it to: val ds2 = ds1.select(ds1(_.a) * 10, ds1(_.b) + ds1(_.c) / 20)and everything proceeds from there in the same fashion as it currently does. Of course, it's possible to put in expressions that can't actually be converted into projections, but the macro will give a compile error in that case. I think this will end up being easier because the syntax is expanded by just one macro, and everything else is implemented as it currently is (rather than struggling with new syntax-based typeclasses that IntelliJ will hate) |
|
Closing in favor of #162. |
This is a proposition for a new syntax for column expression, which I think could be used as the only way to construct
select/groupBy/joins in frameless, it would be an improvement for both #39 and #46. It's at super early stage, but I would like to get your feedback before going forward. Here the syntax I'm managed to get: (modulo type inference, but I would have to look into this further)First commit is the interesting one, second commit is commenting every test to show that the trivial select examples are actually working.