-
Notifications
You must be signed in to change notification settings - Fork 5
add groupByOrdered #346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
add groupByOrdered #346
Conversation
… more of a drop-in replacement for groupBy
def groupBy[K](f: A => K): Map[K, List[A]] = l.groupBy(f) | ||
/** Execute the traversal and group elements by a given transformation function, ignoring the iterator order. Use is discouraged. */ | ||
@Doc(info = | ||
"Execute the traversal and group elements by a given transformation function, ignoring the iterator order. Use is discouraged." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Execute the traversal and group elements by a given transformation function, ignoring the iterator order. Use is discouraged." | |
"Execute the traversal and group elements by a given transformation function, ignoring the iterator order. If you need reproducable results, please use `groupByOrdered` instead." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an explanation of the issue. But use is really discouraged: This is a giant footgun that has blown us up multiple times. (I still shudder at the bug with iteration order in the legacy occurenceHash...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bike-shedding on the name: what do you think about groupByStable
?
while (iterator.hasNext) { | ||
val item = iterator.next | ||
val key = f(item) | ||
res.getOrElseUpdate(key, List.newBuilder[A]).addOne(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to have our own variant, we might as well change it to not use the slowest of all the data structures (linked lists). I.e. return a Vector
or an ArraySeq
instead.
That makes it slightly less of a drop-in replacement (pattern matching on the result requires different operators), but IMO that's an acceptable cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 for Vector
or ArraySeq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector or ArraySeq requires us to copy the data once more. I prepared a version with LinkedHashMap[K, ArrayBuffer[A]]
; the java.util.LinkedHashMap implementation is basically the same as scala.collection.mutable.LinkedHashMap, and writing our own to support even faster groupBy is way overkill.
Immutable data structures are way overrated compared to "just don't mutate the datastructure".
Vector is an amazing feat of engineering. The cool thing is not "immutable", the cool thing is "O(1) snapshots", plus niche applications like good write performance for ZFS on tape drives and hard-drives (writes are sequential!) and SSDs (no write amplification because non-overwriting).
That being said, we only very rarely make active use of O(1) snapshots, and we are running on SRAM/DRAM that supports overwriting, as opposed to flash (which cannot be overwritten).
Yeah, this could be faster if we didn't need it to be a drop-in replacement. That was the version in the first commit. The VectorMap is probably even worse than the List part for the drop-in version. Indeed, the fastest variant would be I tend to agree with you malte. |
I would vote for |
groupBy is a really useful operation.
Unfortunately, we cannot really use the old groupBy in the commercial product: We go to a lot of effort to have reproducible results, and the iterator order of unordered hashmaps is generally not reproducible (because the identity-hash of objects is not reproducible).
One would be tempted to just change groupBy to return an ordered immutable Map. This won'tt work in practice: We cannot reliably overwrite the scala stdlib groupBy (people can write
List(a,b,c).groupBy
which doesn't hit our implicit code at all).Hence, a new name is needed, in order to make it obvious from our source code whether the groupBy operation respects order or not. However, the new groupByOrdered is a drop-in replacement for the old groupBy (it has exactly the same signature).
Once we have replaced most uses of groupBy with groupByOrdered, we should then deprecate groupBy.
PS. cc fabs and malte because git blame shows that you use groupBy a lot, to inform you about the issue and possible solution.