-
-
Notifications
You must be signed in to change notification settings - Fork 138
Add missing Dataset.cube and rollup methods #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| val dataset = TypedDataset.create(data) | ||
| val A = dataset.col[A]('a) | ||
|
|
||
| val received = dataset.cubeMany(A).agg(count()).collect().run().toVector.sortBy(_._2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns TypedDataset[Tuple1[Option[A]]] instead of TypedDataset[(Option[A], Long)] for some reason.
|
I looks at the diff very quickly and it looks fine so far, is there anything in particular where you want feedback? |
# Conflicts: # dataset/src/main/scala/frameless/TypedDataset.scala
|
@OlivierBlanvillain Right now I'm looking for ways to include methods like ds.groupByMany(ds('a)).count()instead of // .agg(count()) won't compile
ds.groupByMany(ds('a)).agg(count[X1[A]]())But it might be tricky so maybe if/after everything is fine with this PR we could do separate issue and PR |
| @@ -0,0 +1,20 @@ | |||
| package frameless.ops | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package frameless
package ops
| @@ -0,0 +1,212 @@ | |||
| package frameless.ops | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Codecov Report
@@ Coverage Diff @@
## master #246 +/- ##
==========================================
- Coverage 96.57% 96.03% -0.55%
==========================================
Files 51 52 +1
Lines 876 908 +32
Branches 11 12 +1
==========================================
+ Hits 846 872 +26
- Misses 30 36 +6
Continue to review full report at Codecov.
|
|
Any idea why it doesn't catch some lines in code coverage? I'm pretty sure it goes to applyProduct |
|
Looks like a bug in the coverage tool, don't worry about it. (I'll try to review this PR this week-end, sorry for taking too long) |
| def cube[K1, K2]( | ||
| c1: TypedColumn[T, K1], | ||
| c2: TypedColumn[T, K2] | ||
| ): Cube2Ops[K1, K2, T] = new Cube2Ops[K1, K2, T](this, c1, c2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not use CubeManyOps for the arity 1 and 2 methods? Same question for the rollup methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It helps compiler and with type inference:
...
val A = dataset.col[A]('a)
// Rollup1Ops, compiles, infers to TypedDataset[(Option[A], Long)]
dataset.rollup(A).agg(count())
// RollupMany, doesn't compile, infers to TypedDataset[Tuple1[Option[A]]]
dataset.rollup(A).agg(count())
// RollupMany, compiles if you specify type
dataset.rollup(A).agg(count[X1[A]])I've followed groupBy and select where there are few extra methods to help with simpler use cases.
I guess the difference is that with RelationalGroups1Ops I can specify the agg return type like:
def agg[U1](c1: TypedAggregate[V, U1]): TypedDataset[(Option[K1], U1)]as opposed to leaving it to macros.
Maybe it is possible to add aggregation methods and allow syntax like in Spark:
dataset.rollup(A).count()I wasn't able to crack it yet, I'm running into similar problems.
| (implicit | ||
| i0: ColumnTypes.Aux[T, TK, K], | ||
| i1: ToTraversable.Aux[TK, List, UntypedExpression[T]], | ||
| i3: Tupler.Aux[K, KT] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i2 :)
Same typo on RollupManyOps
| import shapeless.ops.hlist.{ToTraversable, Tupler} | ||
| import shapeless.{HList, HNil} | ||
|
|
||
| class CubeManyOps[T, TK <: HList, K <: HList, KT](self: TypedDataset[T], groupedBy: TK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could append these classes and the rollup ones to the RelationalGroupsOps file as they don't make much sense in isolation.
| ) { | ||
| object agg extends ProductArgs { | ||
| /** | ||
| * @param i3 shares individual columns' types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that comments on these kind of implicits do not belong to the scaladoc but should go alongside the code instead. Something like this:
def applyProduct[TC <: HList, C <: HList, OptK <: HList, Out0 <: HList, Out1]
(columns: TC)
(implicit
i3: AggregateTypes.Aux[T, TC, C], // shares individual columns' types
i4: Mapped.Aux[K, Option, OptK], // maps all types in HList to Option
i5: Prepend.Aux[OptK, C, Out0], // concatenates two HLists
i6: Tupler.Aux[Out0, Out1], // converts HList to Tuple
i7: TypedEncoder[Out1], // proof that there is `TypedEncoder` for the output type
i8: ToTraversable.Aux[TC, List, UntypedExpression[T]] // allows converting thi HList to ordinary List
): TypedDataset[Out1] = {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, I wanted to be consistent with SmartProject but in case of RelationalGroupsOps with two implicit lists it's hard to read it like that
| * @tparam K individual columns' types as HList | ||
| * @tparam KT individual columns' types as Tuple | ||
| */ | ||
| private[ops] abstract class RelationalGroupsOps[T, TK <: HList, K <: HList, KT] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's possible to remove GroupedByManyOps and use this class instead? The implementations look vers similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference is return type but code can be the same so it should be doable to some extent.
What are rules about deprecation? If:
First release -> @deprecated annotation
Second release -> removed
I could remove deprecated methods from GroupByOps along the way (since we've had new release recently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have moved quite a bit of repeating code to AggregatingOps which is extended by both GroupedByManyOps and RelationalGroupsOps but maybe there is a better way
|
LGTM! @imarios do you want to give this a second look before merging? |
|
@OlivierBlanvillain thanks, let me give it a quick look. |
Connects to #163
todo:
cube,rollupandgroupBy.mapGroups.BigDecimal. (they were failing because vanilla Spark variant was usingjava.math.BigDecimal, I just went for Doubles)cubeManyandrollupManytests.It's not finished yet but I'd love your opinion in the meantime. :)