Add missing Dataset.cube and rollup methods #246

Avasil · 2018-02-07T15:00:05Z

Connects to #163

todo:

Scaladocs for cube, rollup and groupBy.
Decide whether to keep mapGroups.
Fix tests not passing for BigDecimal. (they were failing because vanilla Spark variant was using java.math.BigDecimal, I just went for Doubles)
Fix compilation error in cubeMany and rollupMany tests.
Try to avoid duplicating test code. (I came to the conclusion that it might be better to leave it as it is because generalizing it just for those two methods might not be worth it because it makes it confusing for someone looking for more examples in code)

It's not finished yet but I'd love your opinion in the meantime. :)

Avasil · 2018-02-07T15:02:54Z

dataset/src/test/scala/frameless/forward/CubeTests.scala

+      val dataset = TypedDataset.create(data)
+      val A = dataset.col[A]('a)
+
+      val received = dataset.cubeMany(A).agg(count()).collect().run().toVector.sortBy(_._2)


It returns TypedDataset[Tuple1[Option[A]]] instead of TypedDataset[(Option[A], Long)] for some reason.

OlivierBlanvillain · 2018-02-11T15:10:17Z

I looks at the diff very quickly and it looks fine so far, is there anything in particular where you want feedback?

# Conflicts: # dataset/src/main/scala/frameless/TypedDataset.scala

Avasil · 2018-02-12T12:22:02Z

@OlivierBlanvillain
I think I'm done (unfortunately Travis CI doesn't pass because job exceeded maximum time limit).
I wonder about mapGroups, I've included it only because it is available in groupBy.

Right now I'm looking for ways to include methods like count() in RelationalGroupsOps and GroupByOps to help with type inference and to enable calls like:

ds.groupByMany(ds('a)).count()

instead of

// .agg(count()) won't compile
ds.groupByMany(ds('a)).agg(count[X1[A]]())

But it might be tricky so maybe if/after everything is fine with this PR we could do separate issue and PR

imarios · 2018-02-13T05:05:33Z

dataset/src/main/scala/frameless/ops/CubeOps.scala

@@ -0,0 +1,20 @@
+package frameless.ops


package frameless
package ops

imarios · 2018-02-13T05:05:51Z

dataset/src/main/scala/frameless/ops/RelationalGroupsOps.scala

@@ -0,0 +1,212 @@
+package frameless.ops


codecov-io · 2018-02-13T07:37:38Z

Codecov Report

Merging #246 into master will decrease coverage by 0.54%.
The diff coverage is 88.46%.

@@            Coverage Diff             @@
##           master     #246      +/-   ##
==========================================
- Coverage   96.57%   96.03%   -0.55%     
==========================================
  Files          51       52       +1     
  Lines         876      908      +32     
  Branches       11       12       +1     
==========================================
+ Hits          846      872      +26     
- Misses         30       36       +6

Impacted Files	Coverage Δ
...ataset/src/main/scala/frameless/TypedDataset.scala	`100% <ø> (ø)`	⬆️
...main/scala/frameless/ops/RelationalGroupsOps.scala	`79.16% <79.16%> (ø)`
...aset/src/main/scala/frameless/ops/GroupByOps.scala	`98.36% <96.42%> (+0.21%)`	⬆️
...c/main/scala/frameless/TypedDatasetForwarded.scala	`71.42% <0%> (-2.86%)`	⬇️
dataset/src/main/scala/frameless/TypedColumn.scala	`100% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9251c60...9946112. Read the comment docs.

Avasil · 2018-02-22T07:38:23Z

Any idea why it doesn't catch some lines in code coverage?

I'm pretty sure it goes to applyProduct

OlivierBlanvillain · 2018-02-22T07:44:55Z

Looks like a bug in the coverage tool, don't worry about it. (I'll try to review this PR this week-end, sorry for taking too long)

OlivierBlanvillain · 2018-02-25T06:38:17Z

dataset/src/main/scala/frameless/TypedDataset.scala

+  def cube[K1, K2](
+    c1: TypedColumn[T, K1],
+    c2: TypedColumn[T, K2]
+  ): Cube2Ops[K1, K2, T] = new Cube2Ops[K1, K2, T](this, c1, c2)


Any reason to not use CubeManyOps for the arity 1 and 2 methods? Same question for the rollup methods.

It helps compiler and with type inference:

... val A = dataset.col[A]('a) // Rollup1Ops, compiles, infers to TypedDataset[(Option[A], Long)] dataset.rollup(A).agg(count()) // RollupMany, doesn't compile, infers to TypedDataset[Tuple1[Option[A]]] dataset.rollup(A).agg(count()) // RollupMany, compiles if you specify type dataset.rollup(A).agg(count[X1[A]])

I've followed groupBy and select where there are few extra methods to help with simpler use cases.

I guess the difference is that with RelationalGroups1Ops I can specify the agg return type like:

def agg[U1](c1: TypedAggregate[V, U1]): TypedDataset[(Option[K1], U1)]

as opposed to leaving it to macros.

Maybe it is possible to add aggregation methods and allow syntax like in Spark:

dataset.rollup(A).count()

I wasn't able to crack it yet, I'm running into similar problems.

OlivierBlanvillain · 2018-02-25T06:41:16Z

dataset/src/main/scala/frameless/ops/CubeOps.scala

+  (implicit
+    i0: ColumnTypes.Aux[T, TK, K],
+    i1: ToTraversable.Aux[TK, List, UntypedExpression[T]],
+    i3: Tupler.Aux[K, KT]


i2 :)
Same typo on RollupManyOps

OlivierBlanvillain · 2018-02-25T06:43:32Z

dataset/src/main/scala/frameless/ops/CubeOps.scala

+import shapeless.ops.hlist.{ToTraversable, Tupler}
+import shapeless.{HList, HNil}
+
+class CubeManyOps[T, TK <: HList, K <: HList, KT](self: TypedDataset[T], groupedBy: TK)


Maybe you could append these classes and the rollup ones to the RelationalGroupsOps file as they don't make much sense in isolation.

OlivierBlanvillain · 2018-02-25T06:46:31Z

dataset/src/main/scala/frameless/ops/RelationalGroupsOps.scala

+  ) {
+  object agg extends ProductArgs {
+    /**
+      * @param i3 shares individual columns' types


I feel that comments on these kind of implicits do not belong to the scaladoc but should go alongside the code instead. Something like this:

def applyProduct[TC <: HList, C <: HList, OptK <: HList, Out0 <: HList, Out1] (columns: TC) (implicit i3: AggregateTypes.Aux[T, TC, C], // shares individual columns' types i4: Mapped.Aux[K, Option, OptK], // maps all types in HList to Option i5: Prepend.Aux[OptK, C, Out0], // concatenates two HLists i6: Tupler.Aux[Out0, Out1], // converts HList to Tuple i7: TypedEncoder[Out1], // proof that there is `TypedEncoder` for the output type i8: ToTraversable.Aux[TC, List, UntypedExpression[T]] // allows converting thi HList to ordinary List ): TypedDataset[Out1] = {

I think you're right, I wanted to be consistent with SmartProject but in case of RelationalGroupsOps with two implicit lists it's hard to read it like that

OlivierBlanvillain · 2018-02-25T06:51:10Z

dataset/src/main/scala/frameless/ops/RelationalGroupsOps.scala

+  * @tparam K individual columns' types as HList
+  * @tparam KT individual columns' types as Tuple
+  */
+private[ops] abstract class RelationalGroupsOps[T, TK <: HList, K <: HList, KT]


Do you think it's possible to remove GroupedByManyOps and use this class instead? The implementations look vers similar.

The only difference is return type but code can be the same so it should be doable to some extent.

What are rules about deprecation? If:
First release -> @deprecated annotation
Second release -> removed
I could remove deprecated methods from GroupByOps along the way (since we've had new release recently)

Sounds good to me!

Okay, I have moved quite a bit of repeating code to AggregatingOps which is extended by both GroupedByManyOps and RelationalGroupsOps but maybe there is a better way

OlivierBlanvillain · 2018-02-26T17:06:36Z

LGTM! @imarios do you want to give this a second look before merging?

imarios · 2018-02-26T17:11:25Z

@OlivierBlanvillain thanks, let me give it a quick look.

Add missing Dataset.cube and rollup methods

3dc909b

Avasil commented Feb 7, 2018

View reviewed changes

moved tests to ops package

c5afd1c

OlivierBlanvillain mentioned this pull request Feb 11, 2018

Missing Dataset methods #163

Open

76 tasks

Avasil added 3 commits February 12, 2018 11:40

scaladocs, fix compilation error

837c325

Merge branch 'master' into typeddataset#cube

db6f7c6

# Conflicts: # dataset/src/main/scala/frameless/TypedDataset.scala

cleanup after resolving merge conflicts

894e3b1

Avasil changed the title ~~(WIP) Add missing Dataset.cube and rollup methods~~ Add missing Dataset.cube and rollup methods Feb 12, 2018

imarios reviewed Feb 13, 2018

View reviewed changes

separated packages

5ff5306

OlivierBlanvillain and others added 3 commits February 20, 2018 14:57

Merge branch 'master' into typeddataset#cube

7e9c496

Merge branch 'master' into typeddataset#cube

29d22b9

added few missing tests

72a5942

OlivierBlanvillain approved these changes Feb 25, 2018

View reviewed changes

Avasil added 3 commits February 26, 2018 09:14

moved CubeOps and RollupOps to RelationalGroupsOps, refactored comments

6c242ae

formatting

c740802

added AggregatingOps to contain shared code

9946112

imarios merged commit c631142 into typelevel:master Mar 1, 2018

Avasil mentioned this pull request Mar 15, 2018

Rollup/cube support for TypedDataset #41

Closed

Uh oh!

Add missing Dataset.cube and rollup methods #246

Add missing Dataset.cube and rollup methods #246

Uh oh!

Conversation

Avasil commented Feb 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OlivierBlanvillain commented Feb 11, 2018

Uh oh!

Avasil commented Feb 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Feb 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Avasil commented Feb 22, 2018

Uh oh!

OlivierBlanvillain commented Feb 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OlivierBlanvillain commented Feb 26, 2018

Uh oh!

imarios commented Feb 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Avasil commented Feb 7, 2018 •

edited

Loading

codecov-io commented Feb 13, 2018 •

edited

Loading

OlivierBlanvillain commented Feb 22, 2018 •

edited

Loading