[Final] Unifying typed column and aggregate column. #153

imarios · 2017-06-14T20:15:08Z

This fixes #148 as well. It essentially uses a base class for both TypedColumn and TypedAggregate that implements the vast majority of the methods once. Now, whatever you can do with a TypedColumn you can also do with TypedAggregate (like compare, multiply, cast, etc.).

In its core, this implementation uses a type member to help with type inference. This ensures that aggregated types cannot be used where a simple column is expected and vis a versa.

codecov-io · 2017-06-14T22:24:14Z

Codecov Report

Merging #153 into master will increase coverage by 0.2%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master     #153     +/-   ##
=========================================
+ Coverage   96.62%   96.83%   +0.2%     
=========================================
  Files          52       52             
  Lines         860      852      -8     
  Branches       11       11             
=========================================
- Hits          831      825      -6     
+ Misses         29       27      -2

Impacted Files	Coverage Δ
...set/src/main/scala/frameless/FramelessSyntax.scala	`100% <100%> (ø)`	⬆️
...la/frameless/functions/NonAggregateFunctions.scala	`100% <100%> (ø)`	⬆️
...scala/frameless/functions/AggregateFunctions.scala	`100% <100%> (ø)`	⬆️
...t/src/main/scala/frameless/functions/package.scala	`100% <100%> (ø)`	⬆️
dataset/src/main/scala/frameless/TypedColumn.scala	`100% <100%> (ø)`	⬆️
...c/main/scala/frameless/TypedDatasetForwarded.scala	`72.22% <0%> (-2.07%)`	⬇️
...ataset/src/main/scala/frameless/TypedDataset.scala	`100% <0%> (+8.82%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 075424e...bbe975d. Read the comment docs.

imarios · 2017-06-16T06:53:24Z

Hi @kanterov and @OlivierBlanvillain. I think this is a step towards the right approach of unifying TypedColumn with TypedAggregate. It contains almost no duplication between the two. The base class named GenericTypedColumn contains all the shared code.

kanterov · 2017-07-18T12:44:29Z

I'm not sure if it's optimal approach, what if we do it as it was done before? 18e1cdd was there any problem with it except confusion?

kanterov · 2017-07-18T12:57:58Z

In this pull request TypedAggregate and TypedColumn should have the same U, while before we allowed to have different types, for instance, because select will produce Option[U], while agg always U. Do we still need support for such things? Probably hierarchy can be formed as

UntypedColumn[T]<-------+
      ^                 ^ 
      |                 |
 TypedColumn[T, U]  TypedAggregate[T, U]
      ^                 ^
      |                 |
      TypedAggregateAndColumn[T, U]

Or we can even use phantom types, something like

private sealed abstract class UntypedColumn[T, U](expr: Expression)(
  implicit encoder: TypedEncoder[U])

sealed trait TypedColumnTag[T, U]
sealed trait TypedAggregateTag[T, U]

type TypedAggregate[T, U] = UntypedColumn[T, U] with TypedAggregateTag[T, U]
type TypedColumn[T, U] = UntypedColumn[T, U] with TypedColumnTag[T, U]
type TypedColumnAndAggregate[T, U] = UntypedColumn[T, U] with TypedColumnTag[T, U] with TypedAggregateTag[T, U]

imarios · 2018-01-16T05:48:14Z

@OlivierBlanvillain working thith Frameless is really not fun without solving this. Each time you need to do something to a column you aggregate, well you cannot. You need to create a new dataset and then do something on that column then (like cast it, multiply it, etc.).

This PR let's you do things like:

c.groupBy(c('a)).agg(count()/2, sum(c('b)) * 2, sum(c('d)) > 10)

Whereas now the only way to do this is by:

val tmp = c.groupBy(c('a)).agg(count(), sum(c('b)), sum(c('d))) 
tmp.select(tmp('_1)/2, tmp('_2) * 2, tmp('_3)>10)

not cool.

imarios · 2018-01-17T05:49:07Z

@OlivierBlanvillain I am all done with what I wanted to do with this PR. Will you have any time to look at this?

OlivierBlanvillain · 2018-01-17T07:06:41Z

I will, hopefully by the end of the week 😄

OlivierBlanvillain

@imarios sorry for leaving this unreviewed for so long. I have to say I don't really find the encoding... But I don't really see a better way, and this is already minimal in term of duplicated code, so I guess my feeling alone is not a good reason to hold back on these changes. I would be interested to hear if someone has ideas for an another way to achieve the same thing (ping @kanterov) 😄

OlivierBlanvillain · 2018-01-20T16:44:40Z

dataset/src/main/scala/frameless/TypedColumn.scala

+
+  /** Creates a typed column of either TypedColumn or TypedAggregate.
+    */
+  protected def mkLit[U1: TypedEncoder](c: U1): TC[U1]


Not a fan of the mk prefix here, it sounds like this is doing more than it actually does. What about def lit and def typed?

OlivierBlanvillain · 2018-01-20T16:58:22Z

dataset/src/main/scala/frameless/TypedColumn.scala

-  def this(column: Column)(implicit uencoder: TypedEncoder[U]) {
-    this(FramelessInternals.expr(column))
-  }
+  type TC[A] <: AbstractTypedColumn[T, A]


hmm, that's F-bounded polymorphism, the heaviest hammer available... But I don't really see a better way here. Still, I have a couple of questions:

Why a type member instead of a type parameter?

Why not expose T in TC as well? This is definitely something I'm going to need for my work on joins (given that T tracks the "source" of a column)

Both of these are things I considered. No particular reason on taking this approach other that it simplifies the type signatures throughout. It does exposes less, so given that you have already a use case that needs T, I can make these changes.

@OlivierBlanvillain I didn't find a nice way to do the same I do with using a type member. Any suggestions? It gets into this recursive type definition and finding myself to use a type-lamda. This is what I have:

AbstractTypedColumn[T, U, TC[T, _] <: AbstractTypedColumn[T, U, _]]

But I would want to write something like:

AbstractTypedColumn[T, U, TC[T,?] <: AbstractTypedColumn[T, U, ?]]

Idk, unless you have any other ideas, I think type member looks much better for doing this than a type parameter.

I guess you could just write TC[_, _] (without the <: part), no need to constrain it further given than it's for internal use only.

I take that back, you obviously need the the <: AbstractTypedColumn to be able to do call methods on TC... Here is the type parameter version:

def AbstractTypedColumn [T, U, ThisType[x, y] <: AbstractTypedColumn[x, y, ThisType]] ...

I see no benefits of using that instead of the type member version, so let's stick to what's already here.

OlivierBlanvillain · 2018-01-20T16:59:36Z

dataset/src/main/scala/frameless/functions/package.scala

  object aggregate extends AggregateFunctions
  object nonAggregate extends NonAggregateFunctions

+  private def typedColumnToAggregate[A: TypedEncoder, T](a: TypedColumn[T, A]): TypedAggregate[T, A] =


I think both of these methods are used exactly once. Any reason not to inline them?

probably though initially that they will used more often and then forgotten to revise. Yes, I can inline these.

OlivierBlanvillain · 2018-01-20T17:00:01Z

dataset/src/test/scala/frameless/SchemaTests.scala

 package frameless

 import frameless.functions.aggregate._
+import frameless.functions._


Unused import?

I can check

it is needed for lit.

OlivierBlanvillain · 2018-01-20T17:02:24Z

dataset/src/main/scala/frameless/TypedColumn.scala

+    * @param u another column of the same type
+    * apache/spark
+    */
+  def <(u: TypedColumn[T, U])(implicit canOrder: CatalystOrdered[U]): TC[Boolean] =


Why is < from TypedColumn to TC, whereas or is from TC to TC?

Good catch, missed this one.

I think the next 3 functions also have the same typo

OlivierBlanvillain · 2018-01-20T17:05:32Z

dataset/src/test/scala/frameless/functions/AggregateFunctionsTests.scala

+    def prop(xs: List[Long]): Prop = {
+      val dataset = TypedDataset.create(xs.map(X1(_)))
+      val A = dataset.col[Long]('a)
+      val datasetMax = dataset.agg(max(A) * 2).collect().run().headOption


How confident are you that all operations are supported on aggregated columns? Given what you've found with orderBy I wouldn't be surprised if only a subset was, so maybe we should be a bit more exhaustive. If there was a way to do that without duplicating all the column tests, that would be ideal.

pretty confident. All the columns inside an agg() (after you applied an aggregation method), are treated as regular columns and they support all ops. This is a bit different from orderBy, where the selected columns are not really projections; just columns you use to order the data. On the other hand, with select() and agg(), columns included there are the columns on the resulting dataframe (its schema), and I found those to be consistent.

OlivierBlanvillain · 2018-01-21T14:21:17Z

@imarios I played with it a bit and I'm getting more and more convinced that there is not better way to do that :)

I just pushed a commit that exposes the second type parameter and renames TC to ThisType (it's the standard name in this pattern, TC sound more like an acronym for typeclass to me), if that's OK for you that it's all LGTM!

imarios · 2018-01-21T16:35:23Z

Looks great! Thanks @OlivierBlanvillain :) I didn't even know that this pattern was called F-bounded polymorphism, so thank you for the lesson! I will do another quick pass, rebase, squash and merge.

imarios · 2018-01-22T05:41:48Z

@OlivierBlanvillain , ok so during my "quick pass" I realized that a large collection of methods under NonAggregateFunctions are not included. As with the previous changes, these methods can operate on both kinds (projected+aggregated). This new commit changes the NonAggregateFunctions to work for both. Now you can do things like:

val t = TypedDataset.create(("a","b")::("a","c")::Nil)
t.groupBy(t('_1)).agg(concatWs(":",first(t('_2)), last(t('_2)))).show().run
+---+---+
| _1| _2|
+---+---+
|  a|b:c|
+---+---+

OlivierBlanvillain · 2018-01-22T16:11:13Z

dataset/src/main/scala/frameless/functions/NonAggregateFunctions.scala

+  def atan2[A, T](l: AbstractTypedColumn[T, A], r: Double)
+     (implicit
+      evCanBeDoubleL: CatalystCast[A, Double]): l.ThisType[T, Double] =
+    atan2(l, l.lit(r)).asInstanceOf[l.ThisType[T, Double]]


Why is this cast neeeded?

Good question, idk, but it doesn't compile otherwise. Probably some type inference quark. It works for the other one that has the right parameter to be AbstractTypedColumn[T, A] but it doesn't work for this one her. I am kind of "helping" the compiler with the asInstanceOf. If you have this branch already locally, can you give it a shot?

It's because it calls into atan2 which doesn't check both l & r have same the same type. I'll fix it

OlivierBlanvillain · 2018-01-22T16:12:41Z

dataset/src/main/scala/frameless/functions/NonAggregateFunctions.scala

-  def levenshtein[T](l: TypedColumn[T, String], r: TypedColumn[T, String]): TypedColumn[T, Int] = {
-    new TypedColumn[T, Int](untyped.levenshtein(l.untyped, r.untyped))
-  }
+  def levenshtein[T](l: AbstractTypedColumn[T, String], r: AbstractTypedColumn[T, String]): l.ThisType[T, Int] =


Why not r.ThisType? Does this even make sense when l.ThisType != r.ThisType?

Same question for concat and concatWs

Meaning that one is an aggregate column and the other one is not? Let me try what would happen there.

Yea, good catch! This one fails in case you use one that is aggregate and one that is not ... so all of them need to be of the same kind (either all aggregates or all projections).

thought I found an easy solution ... was kind of surprised this didn't work ...

def concatWs[T, G[_,_] <: AbstractTypedColumn[T, String]](sep: String, c1: G[T, String], rest: G[T, String]*): c1.ThisType[T, String] = c1.typed(untyped.concat_ws(sep, (c1 +: rest).map(_.untyped): _*))

Even with -Ypartial-unification scalac can't infer these types :(

Never seen this done this way. Nice :).

Yeah, even doing this, it doesn't really force all Gs to be of the same type. It would still compile when you have one being aggr and one being projected..

Well in theory it should do the right thing, but type inference can't handle the [_, _] here. This is the signature I have for levenshtein:

def levenshtein[T, ColumnType[a, b] <: AbstractTypedColumn[a, b]]( l: AbstractTypedColumn[T, String] { type ThisType[a, b] = ColumnType[a, b] }, r: AbstractTypedColumn[T, String] { type ThisType[a, b] = ColumnType[a, b] } ): ColumnType[T, Int] = l.typed(untyped.levenshtein(l.untyped, r.untyped))

I think we can just go for the dirty solution: duplicate all the methods involving two AbstractTypedColumn.

OhhI think I got it(?)

Usage:

scala> val t1 = TypedDataset.create((1,"b")::(2,"c")::Nil) t1: frameless.TypedDataset[(Int, String)] = [_1: int, _2: string] scala> import frameless.functions.nonAggregate._ import frameless.functions.nonAggregate._ scala> t1.agg(concatWs(",", sum(t1('_1)).cast[String], t1('_2))).show().run <console>:29: error: Cannot prove that frameless.TypedAggregate[_, _] =:= frameless.TypedColumn[_, _]. t1.agg(concatWs(",", sum(t1('_1)).cast[String], t1('_2))).show().run ^ scala> t1.agg(concatWs(",", sum(t1('_1)).cast[String], sum(t1('_1)).cast[String])).show().run +---+ | _1| +---+ |3,3| +---+

Here is what seemed to work:

def concatWs[T, G1[a,b] <: AbstractTypedColumn[a, b], G2[a,b] <: AbstractTypedColumn[a, b]](sep: String, c1: G1[T, String], rest: G2[T, String]*)(implicit eq: G1[_,_] =:= G2[_,_]): c1.ThisType[T, String] = c1.typed(untyped.concat_ws(sep, (c1 +: rest).map(_.untyped): _*))

Nice 😄. But I'm not sure it's worth the generality, see my latest commit where I duplicate these methods...

OlivierBlanvillain · 2018-01-22T16:14:52Z

dataset/src/test/scala/frameless/functions/NonAggregateFunctionsTests.scala

+    })
  }

-  def stringFuncProp[A : Encoder](strFunc: TypedColumn[X1[String], String] => TypedColumn[X1[String], A], sparkFunc: Column => Column) = {


Why can't we keep this one? So much boilerplate...

I learned this one the hard way (actually spend quite some time trying to make this work). Everything I tried failed in one way or the other with the same compiler error:

NonAggregateFunctionsTests.scala:611: method with dependent type (str: frameless.AbstractTypedColumn[T,String])str.ThisType[T,String] cannot be converted to function value

I think the error is pretty clear. Unless you have an ace up your sleeve, not sure we can do much here than bare the extra code. Note that by writing the extra code I actually had the time to make the tests a bit more realistic. The random Strings from scalacheck always result in non english characters, so upper-casing and replacing numbers never really do anything. I constrained the generators a bit more and actually ingested numbers and whitespaces to make sure that we are testing using data that trigger the logic in question.

OlivierBlanvillain · 2018-01-22T16:17:07Z

dataset/src/test/scala/frameless/functions/NonAggregateFunctionsTests.scala

 package functions
 import frameless.functions.nonAggregate._
-import org.apache.spark.sql.{ Column, Encoder }
+import org.apache.spark.sql.{Encoder, functions => untyped}


It's renamed this way {functions => sparkFunctions} in two other places, this should probably be consistent (I know you just moved it around, but since we are at it...)

imarios · 2018-01-22T21:07:19Z

dataset/src/main/scala/frameless/functions/NonAggregateFunctions.scala

-  def concat[T](c1: AbstractTypedColumn[T, String],
-                rest: AbstractTypedColumn[T, String]*): c1.ThisType[T, String] =
-    c1.typed(untyped.concat((c1 +: rest).map(_.untyped): _*))
+  def concat[T](c1: TypedColumn[T, String], xs: TypedColumn[T, String]*): TypedColumn[T, String] =


Looks much cleaner :D yeah, not worth the generality for sure. Now that we don't need the depended type, we can actually go back to having one vararg parameter.

OlivierBlanvillain · 2018-01-22T22:03:18Z

@imarios do you mind adding a few more tests to please codecov?

imarios · 2018-01-28T03:40:28Z

@OlivierBlanvillain added quite few more tests. Codecov is in fact getting higher by merging this PR.

OlivierBlanvillain · 2018-01-28T10:14:53Z

dataset/src/main/scala/frameless/functions/NonAggregateFunctions.scala

    */
-  def concat[T](c1: TypedColumn[T, String], xs: TypedColumn[T, String]*): TypedColumn[T, String] =
-    c1.typed(untyped.concat((c1 +: xs).map(_.untyped): _*))
+  def concat[T](columns: TypedColumn[T, String]*): TypedColumn[T, String] =


Does this break when called with no arguments?

this is what we had before this PR.

scala> t.select(concat()) res1: frameless.TypedDataset[String] = [_1: string] scala> t.select(concat()).show().run +---+ | _1| +---+ | | | | +---+ scala> t.agg(concatWs(",")).show().run +---+ | _1| +---+ | | +---+

works ok for both agg and select and for all variations.

imarios · 2018-01-29T03:58:19Z

@OlivierBlanvillain any last comments here? If it looks good I will rebase and merge.

OlivierBlanvillain · 2018-01-29T13:15:18Z

LGTM!

…and aggregated columns.

OlivierBlanvillain · 2018-01-30T16:04:48Z

@imarios I think master fails because of this PR, maybe something went wrong in the rebase?

imarios mentioned this pull request Jun 14, 2017

Add .cast support for TypedAggregate #148

Closed

imarios closed this Jun 14, 2017

imarios reopened this Jun 14, 2017

imarios mentioned this pull request Jun 15, 2017

Adding back lit for TypedAggregate (renamed to litAggr) #143

Closed

imarios force-pushed the unifying_typed_column_and_aggregate_column branch 2 times, most recently from ee4d28e to 7496a83 Compare June 15, 2017 13:05

imarios closed this Jun 15, 2017

imarios reopened this Jun 15, 2017

imarios closed this Jun 15, 2017

imarios reopened this Jun 15, 2017

imarios closed this Jun 15, 2017

imarios reopened this Jun 15, 2017

imarios closed this Jun 16, 2017

imarios reopened this Jun 16, 2017

imarios closed this Jun 18, 2017

imarios reopened this Jun 18, 2017

imarios force-pushed the unifying_typed_column_and_aggregate_column branch from 7496a83 to f2ef3f8 Compare January 16, 2018 05:40

imarios changed the title ~~Unifying typed column and aggregate column.~~ [WIP] Unifying typed column and aggregate column. Jan 16, 2018

imarios changed the title ~~[WIP] Unifying typed column and aggregate column.~~ [Final] Unifying typed column and aggregate column. Jan 17, 2018

OlivierBlanvillain reviewed Jan 20, 2018

View reviewed changes

imarios force-pushed the unifying_typed_column_and_aggregate_column branch from 3d72e2f to 46d7d58 Compare January 21, 2018 16:44

imarios mentioned this pull request Jan 21, 2018

Add missing head, head(n), and first methods for TypedDatasets. #230

Merged

OlivierBlanvillain reviewed Jan 22, 2018

View reviewed changes

imarios commented Jan 22, 2018

View reviewed changes

imarios force-pushed the unifying_typed_column_and_aggregate_column branch 2 times, most recently from bf3bb15 to 703eb5d Compare January 27, 2018 23:02

imarios self-assigned this Jan 28, 2018

OlivierBlanvillain reviewed Jan 28, 2018

View reviewed changes

imarios force-pushed the unifying_typed_column_and_aggregate_column branch from 703eb5d to 5cd3979 Compare January 28, 2018 21:50

imarios added the ready label Jan 29, 2018

OlivierBlanvillain mentioned this pull request Jan 29, 2018

Joins with boolean column expressions, 2nd attempt #175

Merged

imarios and others added 6 commits January 29, 2018 21:15

Unifying typed column and aggregate column.

94f6fa7

Rename TC to ThisType, add second type parameter

d0f3076

Adapting the all non-aggregation functions to work on both projected …

35e7ff8

…and aggregated columns.

Update TypedColumn.scala

dd1023f

Address review

4cc4c77

Improving test coverage.

bbe975d

imarios force-pushed the unifying_typed_column_and_aggregate_column branch from 5cd3979 to bbe975d Compare January 30, 2018 05:17

imarios merged commit dc42bfb into typelevel:master Jan 30, 2018

Uh oh!

[Final] Unifying typed column and aggregate column. #153

[Final] Unifying typed column and aggregate column. #153

Uh oh!

Conversation

imarios commented Jun 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Jun 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

imarios commented Jun 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kanterov commented Jul 18, 2017

Uh oh!

kanterov commented Jul 18, 2017

Uh oh!

imarios commented Jan 16, 2018

Uh oh!

imarios commented Jan 17, 2018

Uh oh!

OlivierBlanvillain commented Jan 17, 2018

Uh oh!

OlivierBlanvillain left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OlivierBlanvillain Jan 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imarios Jan 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OlivierBlanvillain commented Jan 21, 2018

Uh oh!

imarios commented Jan 21, 2018

Uh oh!

imarios commented Jan 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imarios Jan 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

imarios commented Jun 14, 2017 •

edited

Loading

codecov-io commented Jun 14, 2017 •

edited

Loading

imarios commented Jun 16, 2017 •

edited

Loading

OlivierBlanvillain Jan 20, 2018 •

edited

Loading

imarios Jan 20, 2018 •

edited

Loading

imarios commented Jan 22, 2018 •

edited

Loading

imarios Jan 22, 2018 •

edited

Loading

imarios Jan 22, 2018 •

edited

Loading

imarios Jan 22, 2018 •

edited

Loading