-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-345: Parameterize CMS to CMS[K] and decouple counting/querying from heavy hitters #354
Conversation
This change parameterizes the type K that is used to identify the elements to be counted. Previously, the CMS implementation hardcoded `Long` as the type. In other words, this change turns the hardcoded CMS[K=Long] into a parameterized CMS[K].
"If a static checker finds a violation of assert it considers it an error in the code, while when require is violated it assumes the caller is at fault." This commit changes assert to require in those places where it is used to validate input parameters. See stackoverflow.com/questions/8002835
Providing default values leads to scalac errors, example:: method aggregator$default$4:[K]=> Double and method aggregator$default$4:[K]=> Double have same type
Hashes must be >= 0 to prevent IndexOutOfBoundsException when accessing rows in the counting table.
/** | ||
* Zero element. Used for initialization. | ||
*/ | ||
case class CMSZero[K: Ordering](override val params: CMSParams[K]) extends CMS[K](params) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the Ordering used here? Is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh. On creating CMSInstance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this seemingly innocent change of going from CMS
to CMS[K]
resulted in more context bound fiddling than I initially expected. ;-)
This is thorough, excellent work. I really appreciate it. I want to merge it, just a few issues. I am concerned that we have now SketchMap and CMS that are almost identical except that SketchMap parameterizes the value. Given that making the key generic didn't hurt performance, I tend to think we must just have some small bottleneck in SketchMap that could be fixed. It would be great to give this a going over with yourkit and see if we can identify a few hotspots, but clearly this is an improvement. This is API breaking, but I think it represents enough of a win and a straight forward upgrade path, so it's not a big pain. One question: is there a type definition we could make to get all the old methods? Like: object CMS {
// This was what CMS was in prior versions of algebird. Change import CMS to CMS.LongKey
type LongKey = TopPctCms[Long]
} |
One more note: would |
Many thanks for the review, Oscar. I really do appreciate it -- this has been a lot of lines to go through!
I see your point. Shall I create a separate ticket for that discussion? Also, IIRC @avibryant was mentioning that SketchMap might be superseded by SpaceSaver. So the discussion may become "SketchMap vs. CMS and/or SpaceSaver".
I have thought a bit about the very same question but haven't come up with a really good approach. :-( The following would be working examples of your object CMS {
type LegacyCMSMonoid = TopPctCMSMonoid[Long]
type LegacyCMS = TopCMS[Long]
} Another idea: Instead of putting it into the import com.twitter.algebird.{LegacyCMS => CMS, LegacyCMSMonoid => CountMinSketchMonoid}
// Now legacy code can still refer to `CMS` and `CountMinSketchMonoid`, if I'm not mistaken. I haven't really tried the latter approach in detail though. Note: The new code hides how the heavy hitters are tracked -- the user only sees a
Ah, nice idea. (some time passes) My quick comparison using the new I'll send an updated PR with the latest changes. |
PS: Regardless of what we will come up with regarding backwards compatibility / upgrade instructions, do you have any plans to add such information to |
For clarity: I don't think SpaceSaver replaces SketchMap, but I do think it might be a good alternative to |
|
Thanks for the clarification on SpaceSaver, Avi!
I created GH-360: Decide on the future of SketchMap, CMS, SpaceSaver.
Looks good to me. I'll experiment with that and will either update the PR or write a comment why it didn't work. ;-)
I, too, am fine with keeping the code as is. But just to clarify, because I am not sure whether I read your comment ("keeping the code as is") correctly: When I talked about |
I added In most cases, users should only need to remove the "new" when creating legacy CMS monoids: // previously
val CMS_MONOID: CountMinSketchMonoid = new CountMinSketchMonoid(EPS, DELTA, SEED)
// with code in this PR and its legacy package
import com.twitter.algebird.legacy.CountMinSketchMonoid
val CMS_MONOID: CountMinSketchMonoid = CountMinSketchMonoid(EPS, DELTA, SEED) Note that because of name collisions for import com.twitter.algebird._
import com.twitter.algebird.legacy.CountMinSketchMonoid
val CMS_MONOID: CountMinSketchMonoid = CountMinSketchMonoid(EPS, DELTA, SEED)
// Example 1: without type annotation
// Here, the written code is the same pre and post merge.
// Behind the scenes, however, scalac will infer type `TopCMS[Long]` for `cms`,
// whereas previously the inferred type was `CMS`.
val cms = CMS_MONOID.create(0L)
// Example 2: with type annotation
// This code won't work because of name collisions for CMS.
val cms: CMS = CMS_MONOID.create(0L)
// The following works:
val cms: legacy.CMS = CMS_MONOID.create(0L) If you have a better idea, please let me know. |
I added new commits as well as a comment/question regarding testing delta/depth and width/eps via roundtripping, see #354 (comment). GitHub hides this comment by default because the updated PR modified the code on which the first code review comment was made. |
GH-345: Parameterize CMS to CMS[K] and decouple counting/querying from heavy hitters
Thanks Oscar for merging, and @jnievelt and @avibryant for their help getting there! |
PS: Let me know if you need a text snippet for CMS upgrade instructions in |
See the discussion in GH-345 for details on this pull request.
Note: This PR contains a "clean" fix plus regression tests for GH-353, i.e. the bug where heavy hitters are not correctly computed in CMS.
High-level summary of this pull request
CMS
with a typeK
to decouple the item identifier type from hardcodedLong
.CMSCounting[K, C[_]]
andCMSHeavyHitters[K]
to decouple counting/querying from heavy hitters tracking.CMS[K]
, which can only counting and estimate.TopCMS[K]
, which can count, estimate, and track heavy hitters. The logic to compute heavy hitters is pluggable, see below.HeavyHittersLogic
(for the lack of a better name) that controls how a CMS that implementsCMSHeavyHitters
tracks heavy hitters. This decouples the "core" Top CMS implementation from the heavy hitters sub-functionality.TopPctLogic
,TopNLogic
.CMS
,TopPctCMS
,TopNCMS
.K in {Short, Int, Long, BigInt}
.++
order bias of top-N CMS.Please take a look at the code and let me know which further changes you'd like me to do so that this PR and thus the improved CMS implementation can be merged.
And many thanks for all the help and support so far!
Michael
PS: For the sake of this PR I decided not to squash individual commits into a single, big commit so it's easier to back out of changes you may not like etc. I can of course squash the commits if needed once the PR code is approved.