Skip to content

Conversation

@Avasil
Copy link
Contributor

@Avasil Avasil commented Feb 7, 2018

Connects to #163

todo:

  • Scaladocs for cube, rollup and groupBy.
  • Decide whether to keep mapGroups.
  • Fix tests not passing for BigDecimal. (they were failing because vanilla Spark variant was using java.math.BigDecimal, I just went for Doubles)
  • Fix compilation error in cubeMany and rollupMany tests.
  • Try to avoid duplicating test code. (I came to the conclusion that it might be better to leave it as it is because generalizing it just for those two methods might not be worth it because it makes it confusing for someone looking for more examples in code)

It's not finished yet but I'd love your opinion in the meantime. :)

val dataset = TypedDataset.create(data)
val A = dataset.col[A]('a)

val received = dataset.cubeMany(A).agg(count()).collect().run().toVector.sortBy(_._2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns TypedDataset[Tuple1[Option[A]]] instead of TypedDataset[(Option[A], Long)] for some reason.

@OlivierBlanvillain OlivierBlanvillain mentioned this pull request Feb 11, 2018
76 tasks
@OlivierBlanvillain
Copy link
Contributor

I looks at the diff very quickly and it looks fine so far, is there anything in particular where you want feedback?

@Avasil Avasil changed the title (WIP) Add missing Dataset.cube and rollup methods Add missing Dataset.cube and rollup methods Feb 12, 2018
@Avasil
Copy link
Contributor Author

Avasil commented Feb 12, 2018

@OlivierBlanvillain
I think I'm done (unfortunately Travis CI doesn't pass because job exceeded maximum time limit).
I wonder about mapGroups, I've included it only because it is available in groupBy.

Right now I'm looking for ways to include methods like count() in RelationalGroupsOps and GroupByOps to help with type inference and to enable calls like:

ds.groupByMany(ds('a)).count()

instead of

// .agg(count()) won't compile
ds.groupByMany(ds('a)).agg(count[X1[A]]())

But it might be tricky so maybe if/after everything is fine with this PR we could do separate issue and PR

@@ -0,0 +1,20 @@
package frameless.ops
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package frameless
package ops

@@ -0,0 +1,212 @@
package frameless.ops
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@codecov-io
Copy link

codecov-io commented Feb 13, 2018

Codecov Report

Merging #246 into master will decrease coverage by 0.54%.
The diff coverage is 88.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #246      +/-   ##
==========================================
- Coverage   96.57%   96.03%   -0.55%     
==========================================
  Files          51       52       +1     
  Lines         876      908      +32     
  Branches       11       12       +1     
==========================================
+ Hits          846      872      +26     
- Misses         30       36       +6
Impacted Files Coverage Δ
...ataset/src/main/scala/frameless/TypedDataset.scala 100% <ø> (ø) ⬆️
...main/scala/frameless/ops/RelationalGroupsOps.scala 79.16% <79.16%> (ø)
...aset/src/main/scala/frameless/ops/GroupByOps.scala 98.36% <96.42%> (+0.21%) ⬆️
...c/main/scala/frameless/TypedDatasetForwarded.scala 71.42% <0%> (-2.86%) ⬇️
dataset/src/main/scala/frameless/TypedColumn.scala 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9251c60...9946112. Read the comment docs.

@Avasil
Copy link
Contributor Author

Avasil commented Feb 22, 2018

Any idea why it doesn't catch some lines in code coverage?

I'm pretty sure it goes to applyProduct

@OlivierBlanvillain
Copy link
Contributor

OlivierBlanvillain commented Feb 22, 2018

Looks like a bug in the coverage tool, don't worry about it. (I'll try to review this PR this week-end, sorry for taking too long)

def cube[K1, K2](
c1: TypedColumn[T, K1],
c2: TypedColumn[T, K2]
): Cube2Ops[K1, K2, T] = new Cube2Ops[K1, K2, T](this, c1, c2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not use CubeManyOps for the arity 1 and 2 methods? Same question for the rollup methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps compiler and with type inference:

...
val A = dataset.col[A]('a)

// Rollup1Ops, compiles, infers to TypedDataset[(Option[A], Long)]
dataset.rollup(A).agg(count())
// RollupMany, doesn't compile, infers to TypedDataset[Tuple1[Option[A]]]
dataset.rollup(A).agg(count())
// RollupMany, compiles if you specify type
dataset.rollup(A).agg(count[X1[A]])

I've followed groupBy and select where there are few extra methods to help with simpler use cases.

I guess the difference is that with RelationalGroups1Ops I can specify the agg return type like:

def agg[U1](c1: TypedAggregate[V, U1]): TypedDataset[(Option[K1], U1)]

as opposed to leaving it to macros.

Maybe it is possible to add aggregation methods and allow syntax like in Spark:

dataset.rollup(A).count()

I wasn't able to crack it yet, I'm running into similar problems.

(implicit
i0: ColumnTypes.Aux[T, TK, K],
i1: ToTraversable.Aux[TK, List, UntypedExpression[T]],
i3: Tupler.Aux[K, KT]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i2 :)
Same typo on RollupManyOps

import shapeless.ops.hlist.{ToTraversable, Tupler}
import shapeless.{HList, HNil}

class CubeManyOps[T, TK <: HList, K <: HList, KT](self: TypedDataset[T], groupedBy: TK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could append these classes and the rollup ones to the RelationalGroupsOps file as they don't make much sense in isolation.

) {
object agg extends ProductArgs {
/**
* @param i3 shares individual columns' types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that comments on these kind of implicits do not belong to the scaladoc but should go alongside the code instead. Something like this:

def applyProduct[TC <: HList, C <: HList, OptK <: HList, Out0 <: HList, Out1]
(columns: TC)
(implicit
  i3: AggregateTypes.Aux[T, TC, C], // shares individual columns' types
  i4: Mapped.Aux[K, Option, OptK],  // maps all types in HList to Option
  i5: Prepend.Aux[OptK, C, Out0],   // concatenates two HLists
  i6: Tupler.Aux[Out0, Out1],       // converts HList to Tuple
  i7: TypedEncoder[Out1],           // proof that there is `TypedEncoder` for the output type
  i8: ToTraversable.Aux[TC, List, UntypedExpression[T]] // allows converting thi HList to ordinary List
): TypedDataset[Out1] = {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, I wanted to be consistent with SmartProject but in case of RelationalGroupsOps with two implicit lists it's hard to read it like that

* @tparam K individual columns' types as HList
* @tparam KT individual columns' types as Tuple
*/
private[ops] abstract class RelationalGroupsOps[T, TK <: HList, K <: HList, KT]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's possible to remove GroupedByManyOps and use this class instead? The implementations look vers similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is return type but code can be the same so it should be doable to some extent.

What are rules about deprecation? If:
First release -> @deprecated annotation
Second release -> removed
I could remove deprecated methods from GroupByOps along the way (since we've had new release recently)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have moved quite a bit of repeating code to AggregatingOps which is extended by both GroupedByManyOps and RelationalGroupsOps but maybe there is a better way

@OlivierBlanvillain
Copy link
Contributor

LGTM! @imarios do you want to give this a second look before merging?

@imarios
Copy link
Contributor

imarios commented Feb 26, 2018

@OlivierBlanvillain thanks, let me give it a quick look.

@imarios imarios merged commit c631142 into typelevel:master Mar 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants