Skip to content

Conversation

@imarios
Copy link
Contributor

@imarios imarios commented Jun 14, 2017

This fixes #148 as well. It essentially uses a base class for both TypedColumn and TypedAggregate that implements the vast majority of the methods once. Now, whatever you can do with a TypedColumn you can also do with TypedAggregate (like compare, multiply, cast, etc.).

In its core, this implementation uses a type member to help with type inference. This ensures that aggregated types cannot be used where a simple column is expected and vis a versa.

@codecov-io
Copy link

codecov-io commented Jun 14, 2017

Codecov Report

Merging #153 into master will increase coverage by 0.2%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #153     +/-   ##
=========================================
+ Coverage   96.62%   96.83%   +0.2%     
=========================================
  Files          52       52             
  Lines         860      852      -8     
  Branches       11       11             
=========================================
- Hits          831      825      -6     
+ Misses         29       27      -2
Impacted Files Coverage Δ
...set/src/main/scala/frameless/FramelessSyntax.scala 100% <100%> (ø) ⬆️
...la/frameless/functions/NonAggregateFunctions.scala 100% <100%> (ø) ⬆️
...scala/frameless/functions/AggregateFunctions.scala 100% <100%> (ø) ⬆️
...t/src/main/scala/frameless/functions/package.scala 100% <100%> (ø) ⬆️
dataset/src/main/scala/frameless/TypedColumn.scala 100% <100%> (ø) ⬆️
...c/main/scala/frameless/TypedDatasetForwarded.scala 72.22% <0%> (-2.07%) ⬇️
...ataset/src/main/scala/frameless/TypedDataset.scala 100% <0%> (+8.82%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 075424e...bbe975d. Read the comment docs.

@imarios imarios force-pushed the unifying_typed_column_and_aggregate_column branch 2 times, most recently from ee4d28e to 7496a83 Compare June 15, 2017 13:05
@imarios imarios closed this Jun 15, 2017
@imarios imarios reopened this Jun 15, 2017
@imarios imarios closed this Jun 15, 2017
@imarios imarios reopened this Jun 15, 2017
@imarios imarios closed this Jun 15, 2017
@imarios imarios reopened this Jun 15, 2017
@imarios imarios closed this Jun 16, 2017
@imarios imarios reopened this Jun 16, 2017
@imarios
Copy link
Contributor Author

imarios commented Jun 16, 2017

Hi @kanterov and @OlivierBlanvillain. I think this is a step towards the right approach of unifying TypedColumn with TypedAggregate. It contains almost no duplication between the two. The base class named GenericTypedColumn contains all the shared code.

@imarios imarios closed this Jun 18, 2017
@imarios imarios reopened this Jun 18, 2017
@kanterov
Copy link
Contributor

I'm not sure if it's optimal approach, what if we do it as it was done before? 18e1cdd was there any problem with it except confusion?

@kanterov
Copy link
Contributor

In this pull request TypedAggregate and TypedColumn should have the same U, while before we allowed to have different types, for instance, because select will produce Option[U], while agg always U. Do we still need support for such things? Probably hierarchy can be formed as

UntypedColumn[T]<-------+
      ^                 ^ 
      |                 |
 TypedColumn[T, U]  TypedAggregate[T, U]
      ^                 ^
      |                 |
      TypedAggregateAndColumn[T, U]

Or we can even use phantom types, something like

private sealed abstract class UntypedColumn[T, U](expr: Expression)(
  implicit encoder: TypedEncoder[U])

sealed trait TypedColumnTag[T, U]
sealed trait TypedAggregateTag[T, U]

type TypedAggregate[T, U] = UntypedColumn[T, U] with TypedAggregateTag[T, U]
type TypedColumn[T, U] = UntypedColumn[T, U] with TypedColumnTag[T, U]
type TypedColumnAndAggregate[T, U] = UntypedColumn[T, U] with TypedColumnTag[T, U] with TypedAggregateTag[T, U]

@imarios imarios force-pushed the unifying_typed_column_and_aggregate_column branch from 7496a83 to f2ef3f8 Compare January 16, 2018 05:40
@imarios
Copy link
Contributor Author

imarios commented Jan 16, 2018

@OlivierBlanvillain working thith Frameless is really not fun without solving this. Each time you need to do something to a column you aggregate, well you cannot. You need to create a new dataset and then do something on that column then (like cast it, multiply it, etc.).

This PR let's you do things like:

c.groupBy(c('a)).agg(count()/2, sum(c('b)) * 2, sum(c('d)) > 10) 

Whereas now the only way to do this is by:

val tmp = c.groupBy(c('a)).agg(count(), sum(c('b)), sum(c('d))) 
tmp.select(tmp('_1)/2, tmp('_2) * 2, tmp('_3)>10)

not cool.

@imarios imarios changed the title Unifying typed column and aggregate column. [WIP] Unifying typed column and aggregate column. Jan 16, 2018
@imarios imarios changed the title [WIP] Unifying typed column and aggregate column. [Final] Unifying typed column and aggregate column. Jan 17, 2018
@imarios
Copy link
Contributor Author

imarios commented Jan 17, 2018

@OlivierBlanvillain I am all done with what I wanted to do with this PR. Will you have any time to look at this?

@OlivierBlanvillain
Copy link
Contributor

I will, hopefully by the end of the week 😄

Copy link
Contributor

@OlivierBlanvillain OlivierBlanvillain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imarios sorry for leaving this unreviewed for so long. I have to say I don't really find the encoding... But I don't really see a better way, and this is already minimal in term of duplicated code, so I guess my feeling alone is not a good reason to hold back on these changes. I would be interested to hear if someone has ideas for an another way to achieve the same thing (ping @kanterov) 😄


/** Creates a typed column of either TypedColumn or TypedAggregate.
*/
protected def mkLit[U1: TypedEncoder](c: U1): TC[U1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of the mk prefix here, it sounds like this is doing more than it actually does. What about def lit and def typed?

def this(column: Column)(implicit uencoder: TypedEncoder[U]) {
this(FramelessInternals.expr(column))
}
type TC[A] <: AbstractTypedColumn[T, A]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that's F-bounded polymorphism, the heaviest hammer available... But I don't really see a better way here. Still, I have a couple of questions:

  • Why a type member instead of a type parameter?
  • Why not expose T in TC as well? This is definitely something I'm going to need for my work on joins (given that T tracks the "source" of a column)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these are things I considered. No particular reason on taking this approach other that it simplifies the type signatures throughout. It does exposes less, so given that you have already a use case that needs T, I can make these changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OlivierBlanvillain I didn't find a nice way to do the same I do with using a type member. Any suggestions? It gets into this recursive type definition and finding myself to use a type-lamda. This is what I have:

AbstractTypedColumn[T, U, TC[T, _] <: AbstractTypedColumn[T, U, _]]

But I would want to write something like:

AbstractTypedColumn[T, U, TC[T,?] <: AbstractTypedColumn[T, U, ?]]

Idk, unless you have any other ideas, I think type member looks much better for doing this than a type parameter.

Copy link
Contributor

@OlivierBlanvillain OlivierBlanvillain Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could just write TC[_, _] (without the <: part), no need to constrain it further given than it's for internal use only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that back, you obviously need the the <: AbstractTypedColumn to be able to do call methods on TC... Here is the type parameter version:

def AbstractTypedColumn
  [T, U, ThisType[x, y] <: AbstractTypedColumn[x, y, ThisType]] ...

I see no benefits of using that instead of the type member version, so let's stick to what's already here.

object aggregate extends AggregateFunctions
object nonAggregate extends NonAggregateFunctions

private def typedColumnToAggregate[A: TypedEncoder, T](a: TypedColumn[T, A]): TypedAggregate[T, A] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both of these methods are used exactly once. Any reason not to inline them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably though initially that they will used more often and then forgotten to revise. Yes, I can inline these.

package frameless

import frameless.functions.aggregate._
import frameless.functions._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is needed for lit.

* @param u another column of the same type
* apache/spark
*/
def <(u: TypedColumn[T, U])(implicit canOrder: CatalystOrdered[U]): TC[Boolean] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is < from TypedColumn to TC, whereas or is from TC to TC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, missed this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the next 3 functions also have the same typo

def prop(xs: List[Long]): Prop = {
val dataset = TypedDataset.create(xs.map(X1(_)))
val A = dataset.col[Long]('a)
val datasetMax = dataset.agg(max(A) * 2).collect().run().headOption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confident are you that all operations are supported on aggregated columns? Given what you've found with orderBy I wouldn't be surprised if only a subset was, so maybe we should be a bit more exhaustive. If there was a way to do that without duplicating all the column tests, that would be ideal.

Copy link
Contributor Author

@imarios imarios Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty confident. All the columns inside an agg() (after you applied an aggregation method), are treated as regular columns and they support all ops. This is a bit different from orderBy, where the selected columns are not really projections; just columns you use to order the data. On the other hand, with select() and agg(), columns included there are the columns on the resulting dataframe (its schema), and I found those to be consistent.

@OlivierBlanvillain
Copy link
Contributor

@imarios I played with it a bit and I'm getting more and more convinced that there is not better way to do that :)

I just pushed a commit that exposes the second type parameter and renames TC to ThisType (it's the standard name in this pattern, TC sound more like an acronym for typeclass to me), if that's OK for you that it's all LGTM!

@imarios
Copy link
Contributor Author

imarios commented Jan 21, 2018

Looks great! Thanks @OlivierBlanvillain :) I didn't even know that this pattern was called F-bounded polymorphism, so thank you for the lesson! I will do another quick pass, rebase, squash and merge.

@imarios imarios force-pushed the unifying_typed_column_and_aggregate_column branch from 3d72e2f to 46d7d58 Compare January 21, 2018 16:44
@imarios
Copy link
Contributor Author

imarios commented Jan 22, 2018

@OlivierBlanvillain , ok so during my "quick pass" I realized that a large collection of methods under NonAggregateFunctions are not included. As with the previous changes, these methods can operate on both kinds (projected+aggregated). This new commit changes the NonAggregateFunctions to work for both. Now you can do things like:

val t = TypedDataset.create(("a","b")::("a","c")::Nil)
t.groupBy(t('_1)).agg(concatWs(":",first(t('_2)), last(t('_2)))).show().run
+---+---+
| _1| _2|
+---+---+
|  a|b:c|
+---+---+

def atan2[A, T](l: AbstractTypedColumn[T, A], r: Double)
(implicit
evCanBeDoubleL: CatalystCast[A, Double]): l.ThisType[T, Double] =
atan2(l, l.lit(r)).asInstanceOf[l.ThisType[T, Double]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this cast neeeded?

Copy link
Contributor Author

@imarios imarios Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, idk, but it doesn't compile otherwise. Probably some type inference quark. It works for the other one that has the right parameter to be AbstractTypedColumn[T, A] but it doesn't work for this one her. I am kind of "helping" the compiler with the asInstanceOf. If you have this branch already locally, can you give it a shot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because it calls into atan2 which doesn't check both l & r have same the same type. I'll fix it

def levenshtein[T](l: TypedColumn[T, String], r: TypedColumn[T, String]): TypedColumn[T, Int] = {
new TypedColumn[T, Int](untyped.levenshtein(l.untyped, r.untyped))
}
def levenshtein[T](l: AbstractTypedColumn[T, String], r: AbstractTypedColumn[T, String]): l.ThisType[T, Int] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not r.ThisType? Does this even make sense when l.ThisType != r.ThisType?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for concat and concatWs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning that one is an aggregate column and the other one is not? Let me try what would happen there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, good catch! This one fails in case you use one that is aggregate and one that is not ... so all of them need to be of the same kind (either all aggregates or all projections).

Copy link
Contributor Author

@imarios imarios Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought I found an easy solution ... was kind of surprised this didn't work ...

def concatWs[T, G[_,_] <: AbstractTypedColumn[T, String]](sep: String,
                  c1: G[T, String],
                  rest: G[T, String]*): c1.ThisType[T, String] =
    c1.typed(untyped.concat_ws(sep, (c1 +: rest).map(_.untyped): _*))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with -Ypartial-unification scalac can't infer these types :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never seen this done this way. Nice :).

Yeah, even doing this, it doesn't really force all Gs to be of the same type. It would still compile when you have one being aggr and one being projected..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well in theory it should do the right thing, but type inference can't handle the [_, _] here. This is the signature I have for levenshtein:

  def levenshtein[T, ColumnType[a, b] <: AbstractTypedColumn[a, b]](
    l: AbstractTypedColumn[T, String] { type ThisType[a, b] = ColumnType[a, b] },
    r: AbstractTypedColumn[T, String] { type ThisType[a, b] = ColumnType[a, b] }
  ): ColumnType[T, Int] =
    l.typed(untyped.levenshtein(l.untyped, r.untyped))

I think we can just go for the dirty solution: duplicate all the methods involving two AbstractTypedColumn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OhhI think I got it(?)

Usage:

scala> val t1 = TypedDataset.create((1,"b")::(2,"c")::Nil)
t1: frameless.TypedDataset[(Int, String)] = [_1: int, _2: string]

scala> import frameless.functions.nonAggregate._
import frameless.functions.nonAggregate._

scala> t1.agg(concatWs(",", sum(t1('_1)).cast[String], t1('_2))).show().run
<console>:29: error: Cannot prove that frameless.TypedAggregate[_, _] =:= frameless.TypedColumn[_, _].
       t1.agg(concatWs(",", sum(t1('_1)).cast[String], t1('_2))).show().run
                      ^

scala> t1.agg(concatWs(",", sum(t1('_1)).cast[String], sum(t1('_1)).cast[String])).show().run
+---+
| _1|
+---+
|3,3|
+---+

Here is what seemed to work:

 def concatWs[T, G1[a,b] <: AbstractTypedColumn[a, b], G2[a,b] <: AbstractTypedColumn[a, b]](sep: String,
                  c1: G1[T, String],
                  rest: G2[T, String]*)(implicit eq: G1[_,_] =:= G2[_,_]): c1.ThisType[T, String] =
    c1.typed(untyped.concat_ws(sep, (c1 +: rest).map(_.untyped): _*))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 😄. But I'm not sure it's worth the generality, see my latest commit where I duplicate these methods...

})
}

def stringFuncProp[A : Encoder](strFunc: TypedColumn[X1[String], String] => TypedColumn[X1[String], A], sparkFunc: Column => Column) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we keep this one? So much boilerplate...

Copy link
Contributor Author

@imarios imarios Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned this one the hard way (actually spend quite some time trying to make this work). Everything I tried failed in one way or the other with the same compiler error:

NonAggregateFunctionsTests.scala:611: method with dependent type (str: frameless.AbstractTypedColumn[T,String])str.ThisType[T,String] cannot be converted to function value

I think the error is pretty clear. Unless you have an ace up your sleeve, not sure we can do much here than bare the extra code. Note that by writing the extra code I actually had the time to make the tests a bit more realistic. The random Strings from scalacheck always result in non english characters, so upper-casing and replacing numbers never really do anything. I constrained the generators a bit more and actually ingested numbers and whitespaces to make sure that we are testing using data that trigger the logic in question.

package functions
import frameless.functions.nonAggregate._
import org.apache.spark.sql.{ Column, Encoder }
import org.apache.spark.sql.{Encoder, functions => untyped}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's renamed this way {functions => sparkFunctions} in two other places, this should probably be consistent (I know you just moved it around, but since we are at it...)

def concat[T](c1: AbstractTypedColumn[T, String],
rest: AbstractTypedColumn[T, String]*): c1.ThisType[T, String] =
c1.typed(untyped.concat((c1 +: rest).map(_.untyped): _*))
def concat[T](c1: TypedColumn[T, String], xs: TypedColumn[T, String]*): TypedColumn[T, String] =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much cleaner :D yeah, not worth the generality for sure. Now that we don't need the depended type, we can actually go back to having one vararg parameter.

@OlivierBlanvillain
Copy link
Contributor

@imarios do you mind adding a few more tests to please codecov?

@imarios imarios force-pushed the unifying_typed_column_and_aggregate_column branch 2 times, most recently from bf3bb15 to 703eb5d Compare January 27, 2018 23:02
@imarios
Copy link
Contributor Author

imarios commented Jan 28, 2018

@OlivierBlanvillain added quite few more tests. Codecov is in fact getting higher by merging this PR.

@imarios imarios self-assigned this Jan 28, 2018
*/
def concat[T](c1: TypedColumn[T, String], xs: TypedColumn[T, String]*): TypedColumn[T, String] =
c1.typed(untyped.concat((c1 +: xs).map(_.untyped): _*))
def concat[T](columns: TypedColumn[T, String]*): TypedColumn[T, String] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this break when called with no arguments?

Copy link
Contributor Author

@imarios imarios Jan 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what we had before this PR.

scala> t.select(concat())
res1: frameless.TypedDataset[String] = [_1: string]

scala> t.select(concat()).show().run
+---+
| _1|
+---+
|   |
|   |
+---+

scala> t.agg(concatWs(",")).show().run
+---+
| _1|
+---+
|   |
+---+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works ok for both agg and select and for all variations.

@imarios imarios force-pushed the unifying_typed_column_and_aggregate_column branch from 703eb5d to 5cd3979 Compare January 28, 2018 21:50
@imarios
Copy link
Contributor Author

imarios commented Jan 29, 2018

@OlivierBlanvillain any last comments here? If it looks good I will rebase and merge.

@imarios imarios added the ready label Jan 29, 2018
@OlivierBlanvillain
Copy link
Contributor

LGTM!

@imarios imarios force-pushed the unifying_typed_column_and_aggregate_column branch from 5cd3979 to bbe975d Compare January 30, 2018 05:17
@imarios imarios merged commit dc42bfb into typelevel:master Jan 30, 2018
@OlivierBlanvillain
Copy link
Contributor

@imarios I think master fails because of this PR, maybe something went wrong in the rebase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add .cast support for TypedAggregate

4 participants