Skip to content

Conversation

bbrehm
Copy link
Contributor

@bbrehm bbrehm commented Sep 17, 2025

groupBy is a really useful operation.

Unfortunately, we cannot really use the old groupBy in the commercial product: We go to a lot of effort to have reproducible results, and the iterator order of unordered hashmaps is generally not reproducible (because the identity-hash of objects is not reproducible).

One would be tempted to just change groupBy to return an ordered immutable Map. This won'tt work in practice: We cannot reliably overwrite the scala stdlib groupBy (people can write List(a,b,c).groupBy which doesn't hit our implicit code at all).

Hence, a new name is needed, in order to make it obvious from our source code whether the groupBy operation respects order or not. However, the new groupByOrdered is a drop-in replacement for the old groupBy (it has exactly the same signature).

Once we have replaced most uses of groupBy with groupByOrdered, we should then deprecate groupBy.

PS. cc fabs and malte because git blame shows that you use groupBy a lot, to inform you about the issue and possible solution.

def groupBy[K](f: A => K): Map[K, List[A]] = l.groupBy(f)
/** Execute the traversal and group elements by a given transformation function, ignoring the iterator order. Use is discouraged. */
@Doc(info =
"Execute the traversal and group elements by a given transformation function, ignoring the iterator order. Use is discouraged."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Execute the traversal and group elements by a given transformation function, ignoring the iterator order. Use is discouraged."
"Execute the traversal and group elements by a given transformation function, ignoring the iterator order. If you need reproducable results, please use `groupByOrdered` instead."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an explanation of the issue. But use is really discouraged: This is a giant footgun that has blown us up multiple times. (I still shudder at the bug with iteration order in the legacy occurenceHash...)

Copy link
Contributor

@maltek maltek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bike-shedding on the name: what do you think about groupByStable?

while (iterator.hasNext) {
val item = iterator.next
val key = f(item)
res.getOrElseUpdate(key, List.newBuilder[A]).addOne(item)
Copy link
Contributor

@maltek maltek Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have our own variant, we might as well change it to not use the slowest of all the data structures (linked lists). I.e. return a Vector or an ArraySeq instead.

That makes it slightly less of a drop-in replacement (pattern matching on the result requires different operators), but IMO that's an acceptable cost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 for Vector or ArraySeq

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector or ArraySeq requires us to copy the data once more. I prepared a version with LinkedHashMap[K, ArrayBuffer[A]]; the java.util.LinkedHashMap implementation is basically the same as scala.collection.mutable.LinkedHashMap, and writing our own to support even faster groupBy is way overkill.

Immutable data structures are way overrated compared to "just don't mutate the datastructure".

Vector is an amazing feat of engineering. The cool thing is not "immutable", the cool thing is "O(1) snapshots", plus niche applications like good write performance for ZFS on tape drives and hard-drives (writes are sequential!) and SSDs (no write amplification because non-overwriting).

That being said, we only very rarely make active use of O(1) snapshots, and we are running on SRAM/DRAM that supports overwriting, as opposed to flash (which cannot be overwritten).

@bbrehm
Copy link
Contributor Author

bbrehm commented Sep 17, 2025

Yeah, this could be faster if we didn't need it to be a drop-in replacement. That was the version in the first commit. The VectorMap is probably even worse than the List part for the drop-in version.

Indeed, the fastest variant would be groupByStable: Iterable[K, mutable.ArrayBuffer[V]] (because LinkedHashMap is also a bit of a performance horror show). A quick review of uses shows that most uses would agree with that (i.e. almost all of the time, we end up iterating over the thing and nothing else).

I tend to agree with you malte.

@maltek
Copy link
Contributor

maltek commented Sep 22, 2025

I would vote for groupByStable: scala.collection.Map[K, scala.collection.Seq[V]] then: the result can only be used as an immutable value, but the actually returned implementation can be a mutable implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants