Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cats.kernel.Hash port for Scala CHAMP HashMap #4193

Closed
wants to merge 29 commits into from
Closed

cats.kernel.Hash port for Scala CHAMP HashMap #4193

wants to merge 29 commits into from

Conversation

DavidGregory084
Copy link
Member

@DavidGregory084 DavidGregory084 commented Apr 21, 2022

This PR implements an immutable hash map using the CHAMP encoding and cats.kernel.Hash for hashing of elements, as per #4147.

This accompanies PR #4185 which implements a hash set.

Contributed on behalf of the @opencastsoftware open source team. 👋

core/src/main/scala/cats/Foldable.scala Outdated Show resolved Hide resolved
core/src/main/scala/cats/data/HashMap.scala Outdated Show resolved Hide resolved
core/src/main/scala/cats/data/HashMap.scala Show resolved Hide resolved
core/src/main/scala/cats/data/HashMap.scala Outdated Show resolved Hide resolved
tests/shared/src/test/scala/cats/tests/HashMapSuite.scala Outdated Show resolved Hide resolved
tests/shared/src/test/scala/cats/tests/HashMapSuite.scala Outdated Show resolved Hide resolved
tests/shared/src/test/scala/cats/tests/HashMapSuite.scala Outdated Show resolved Hide resolved
tests/shared/src/test/scala/cats/tests/HashMapSuite.scala Outdated Show resolved Hide resolved
tests/shared/src/test/scala/cats/tests/HashMapSuite.scala Outdated Show resolved Hide resolved
@johnynek
Copy link
Contributor

johnynek commented May 6, 2022

I think if we beef up the tests as I sketched, I am a champion of merging this.

I think Map APIs are pretty much settled in scala and we can use the existing ones with the natural generalization of SortedMap, Ordering to HashMap, Hash (which is mostly what was done here modulo some add renaming to updated).

I think having a HashMap to work with Hash would be a really nice addition to make Hash actually safe to use, and without it, I Hash is pretty uninteresting (except that other users might internally implement their own hashmaps based on it).

@johnynek
Copy link
Contributor

johnynek commented May 6, 2022

Can you post the benchmark results as a comment and discuss a comparison to the universal hash equivalents (which I assume is very hard to beat, but hopefully we could nearly match).

@armanbilge
Copy link
Member

I think having a HashMap to work with Hash would be a really nice addition to make Hash actually safe to use, and without it, I Hash is pretty uninteresting

I think we will also need to revoke this law:

def sameAsUniversalHash(x: A, y: A): IsEq[Boolean] =
((E.hash(x) == x.hashCode) && (Hash.fromUniversalHashCode[A].hash(x) == x.hashCode()) &&
(E.eqv(x, y) == Hash.fromUniversalHashCode[A].eqv(x, y))) <-> true

@johnynek
Copy link
Contributor

johnynek commented May 6, 2022

I tend to agree. We can possibly make two sets of laws: UniversalLikeHash and GeneralHash or something.... Sometimes we want to test that we have just done the same as universal hash would do (e.g. on primitives or built in collections).

But something is nagging in the back of my mind: Hash and Show are kind of special since the JVM does basically promise Any can implement them. So, one can argue the existence of them as typeclasses in cats isn't the same utility as in a language without that.

Moreover, you could argue that what Hash[K] is really doing is promising that the type K is a "data" value recursively, and it never is hitting System.identityHashCode. In that way, every implementation of Hash[K] is just .hashCode BUT we promise NOT to implement Hash[K] for a type without a reproducible hash. In other words, there is no Hash[Int => Int]. So, we can only hash value types all the way down.

If we take that view, we can keep the law, but still Hash is useful because it proves you have a reproducible hashcode all the way down, no hiding of general new AnyRef or function, or typeclass values, etc...

@armanbilge
Copy link
Member

armanbilge commented May 6, 2022

If we take that view, we can keep the law, but still Hash is useful because it proves you have a reproducible hashcode all the way down, no hiding of general new AnyRef or function, or typeclass values, etc...

Putting aside the fact this wouldn't allow for alternative Hash implementations for opaque newtypes, it's a reasonable perspective. But also then, doesn't it obviate the need for our own hashmap implementation? Hash would provide the evidence that you can create a hashmap, but we could continue to use the scala collections implementation.

@johnynek
Copy link
Contributor

johnynek commented May 6, 2022

I think opaque type is maybe the best counter example to this, but still, I think it has hashCode since it extends Any right?

The utility of Hash[K] in this case is simply the promise that System.identityHashCode will not be called. You can imagine a name like: NotIdentityHashCode[K] as the typeclass. So, you could use Map[K, V] then, but you can't prove to someone else that inside aren't things with identity hashCode. By using HashMap you know you have hashmap that is pure (and not a function of memory location on the process you are running in).

Like, put another way: when would you not want to match .hashCode and why wouldn't you?

I think we may have gotten tripped up and imagined the law means it is 1:1, but it doesn't mean that: all Hash[K] is == .hashCode but not all things with .hashCode has a Hash[K].

E.g. there would not be a Hash[Any] because that may wind up calling identityHashCode.

@armanbilge
Copy link
Member

armanbilge commented May 6, 2022

I think opaque type is maybe the best counter example to this, but still, I think it has hashCode since it extends Any right?

Yes, it certainly does have hashCode. The problem is suppose you want to implement a different Eq and Hash than that of the underlying type e.g. opaque type CIString = String for case-insensitive Strings.

So, you could use Map[K, V] then, but you can't prove to someone else that inside aren't things with identity hashCode.

Sorry, I'm not entirely sure I follow 😕 I guess my point is, if we know that Hash[A].hash(a) == a.hashCode then we can delegate the implementation to a Scala Map. For sure, we should wrap this up in our own safe newtype (such as we do for NonEmptyList and friends) that requires evidence in the form of Hash[K], but there's no need to actually use that for the implementation.

@johnynek
Copy link
Contributor

johnynek commented May 6, 2022

You seem to be arguing two contradictory points:

  1. CIString is an excellent example, if you have that and want to not have to normalize the case, then you can't have Hash[CIString] since the Eq[CIString] needs to not care about case. So, the law is broken here, or you simply cannot implement Hash[CIString]. This tends to argue to remove the law.
  2. If we do not remove the law, we don't need to copy Map[K, V] we could just make a newtype on (Map[K, V], Hash[K]) if we have access to Hash[K] and it would be identical to the current implementation.

So, going down route 2. seems to limit Hash[K] especially in the world of newtypes (not just opaque types, also path dependent types). Thus, I would tend to agree we need to remove the law for matching universal equality (although in practice it will probably only impact corner cases, since when possible it is probably a good idea to simply make them match).

@joroKr21
Copy link
Member

joroKr21 commented May 7, 2022

Note that Scala has .## and the sameAsUniversalHash is already broken when Long is involved: #2878
See also the commend from @ctongfei about the original purpose of that law.
On another hand sometimes I do want to use System.identityHashCode so it seems a weird thing to guarantee that it wouldn't be called. I see it more as a guide - the typeclass nudges you to think about what hash code you want to use.

@DavidGregory084
Copy link
Member Author

Thanks for the detailed review @johnynek, this is awesome! 🙇 I'm on holiday this week but will get back to this as soon as I can!


package cats.data

private[data] trait HashMapCompat[K, +V] extends IterableOnce[(K, V)] { self: HashMap[K, V] =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I should highlight this as it's hard to spot - I've extended IterableOnce[(K, V)] here as it gives us nice interop opportunities with the standard library. Unfortunately TraversableOnce[(K, V)] on 2.12 requires much more from us in terms of method implementations so I decided not to attempt that.

* Use primitive HashMap operations when generating random HashMaps
* Use a better approach to test HashMap key collisions
* Remove reverseIterator
* Foldable.iterateRight signature changed
* HashMap#add renamed to HashMap#updated
* HashMap#remove renamed to HashMap#removed
* Added HashMap.fromFoldable
* Renamed CHAMP node element count fields
* Use NonEmptyVector in CHAMP collision node
* Fixed various nits
@DavidGregory084 DavidGregory084 marked this pull request as ready for review May 16, 2022 16:15
@DavidGregory084 DavidGregory084 changed the title WIP port of Scala CHAMP HashMap cats.Hash port for Scala CHAMP HashMap May 16, 2022
@armanbilge
Copy link
Member

I wonder if the performance of the scala collections Map is better for updated because the collections can check for equality of the value and decide to do nothing if the new value is equal to the existing one.

Dumb question, why can't we do this?

@DavidGregory084
Copy link
Member Author

Dumb question, why can't we do this?

Not a dumb question for sure - we can definitely try adding an Eq[VV] constraint to updated and see how that affects things

@DavidGregory084
Copy link
Member Author

DavidGregory084 commented May 24, 2022

Oh actually I remember why I didn't do that originally - there's no easy way to thread the Eq constraint for the new value type into UnorderedTraverse (which is HashMap[K, U] => (U => G[V]) => G[HashMap[K, V]]), so we would have to give that up and supply only UnorderedFoldable.

@DavidGregory084
Copy link
Member Author

Haha OK, e7343e8 is a total bust unfortunately 😆

2.12 results:
hashmap-benchmark-2 12-concat
hashmap-benchmark-2 12-contains
hashmap-benchmark-2 12-eqEquals
hashmap-benchmark-2 12-foreach
hashmap-benchmark-2 12-fromSeq
hashmap-benchmark-2 12-get
hashmap-benchmark-2 12-iterator
hashmap-benchmark-2 12-removed
hashmap-benchmark-2 12-universalEquals
hashmap-benchmark-2 12-updated

@DavidGregory084
Copy link
Member Author

2.13 results:
hashmap-benchmark-2 13-concat
hashmap-benchmark-2 13-contains
hashmap-benchmark-2 13-eqEquals
hashmap-benchmark-2 13-foreach
hashmap-benchmark-2 13-fromSeq
hashmap-benchmark-2 13-get
hashmap-benchmark-2 13-iterator
hashmap-benchmark-2 13-removed
hashmap-benchmark-2 13-universalEquals
hashmap-benchmark-2 13-updated

@DavidGregory084
Copy link
Member Author

DavidGregory084 commented May 27, 2022

I think in order for that approach to be viable we would have to figure out how to skip a bunch more work.
At the moment the benchmark is a pathological scenario for that concat implementation as it is concatenating two hashmaps with exactly the same elements which results in exactly the same structure.
This means that every node must be merged recursively and every element copied over into this hashmap.
By contrast iterating CHAMP elements seems to be pretty fast!

@armanbilge
Copy link
Member

So, status check on this one :) firstly, round of applause to David for this wonderful piece of work and Oscar for thorough reviews.

I optimistically added this one to the v2.8.0 milestone :) besides this PR, I think we're approaching a good place to wrap up 2.8.0 and ship it in the next week or two. Besides the accumulated changes there's been requests to release with the Scala 3 bincompat fix and Scala 3 Native.

Release planning: can this PR realistically land in 2.8.0, and should it?

Even after this lands, I think there's some follow-up work that could do well to accompany it. Off the top of my head:

For the record, none of these are blockers, but they do go nicely together.

Also, some folks were less enthusiastic about adding this to cats-core versus cats-collections. Now that it's taken shape, I'd like to hear if there are (strong) objections (personally I feel 👍 ).

Thoughts? Thanks again everyone!

@DavidGregory084
Copy link
Member Author

DavidGregory084 commented Jun 6, 2022

Release planning: can this PR realistically land in 2.8.0, and should it?

Yes I think it can - the most recent commit was an attempt to improve the performance of concat but I think we can revert that commit and perhaps I can revive it as a follow-up PR.

I think these are good suggestions but as they are additions perhaps there is no rush to complete them for 2.8.0.

the HashSet in #4185

That PR needs some work to add a bunch more of the basic operations - at the moment it's very sparse with only incl / + and excl / -. However it would really be nice to land that as then we could add operations like keySet for this HashMap.

@armanbilge
Copy link
Member

Apologies, I might let this slip to 2.9.0 so we can land 2.8.0 😓

@armanbilge armanbilge modified the milestones: 2.8.0, 2.9.0 Jun 13, 2022
@DavidGregory084
Copy link
Member Author

@johnynek do you have any thoughts on to proceed with this one? Would you be happy to revert the most recent commit experimenting with the concat implementation and merge this?

@armanbilge I guess the main blocker to getting this into release is the Hash consistency law? What does deprecating a law look like?

@armanbilge
Copy link
Member

I guess the main blocker to getting this into release is the Hash consistency law? What does deprecating a law look like?

Actually I'd say the main blocker is reviews: I am very very grateful to all the time and attention to detail you and Oscar put into this PR, since it is an immense, non-trivial contribution. It would be good to hear from a couple other Cats maintainers, especially those who were hesitant about introducing this in cats-core vs cats-collections. I don't think I'll have time to do an in-depth review, but I would like to make a pass-through the public APIs :)

Everything else I mentioned including the Hash laws are non-blocking nice-to-haves that can definitely be follow-ups. Although personally I believe they are stronger when shipped together, and I think a 2.9 release dedicated exactly to this would be very cool.

I am very excited about this contribution :) and I do think the sooner we can release it the better, because it will need additional ecosystem support (e.g. Circe encoders/decoders) to help with adoption.

@armanbilge
Copy link
Member

Would you be happy to revert the most recent commit experimenting with the concat implementation and merge this?

I trust your judgement, it seems like we should revert that commit for now and move forward with the PR? We can make an issue to follow-up about this.

@DavidGregory084
Copy link
Member Author

OK, I have reverted that commit now @armanbilge and it seems like CI is happy 🤞

@armanbilge
Copy link
Member

Thanks! I will try and carve out time for a review and follow-up with Daniel about his concerns for including this in cats-core instead of cats-collections. Would appreciate a final review(s) from Oscar. And if there's anyone else that wants chance to review this, please say so, even if you won't be able to do it right away :)

@armanbilge
Copy link
Member

So following up on this: I did ask on Discord and Daniel and Michael chimed in that this is a much better candidate for cats-collections. My own thoughts/summary from that dialog follow. See also #4185 (comment).

The main two issues seem to be:

  1. The high bar for maintaining compatibility in Cats.
  2. Whether collections like this fundamentally belong in cats-core or not.

I actually feel okay about (1) since I think this is a straightforward API and I am generally optimistic about the possibility for binary-compatible evolution.

(2) is more subjective of course. Maps are useful for various Cats' APIs, such as "groupBy" methods and there are already a handful of collections in cats.data.*.

I was also hoping this will land in cats-core to facilitate downstream adoption e.g. Cats Effect MapRef, Circe Encoder/Decoder, etc. But that in of itself is not a good reason to put something in cats-core.

Another strategy could be to work towards stable 1.0 of cats-collections with these additions, that downstreams can depend on without fear of breakage. See typelevel/cats-collections#277. /cc @typelevel/cats-collections

Thoughts?

@DavidGregory084
Copy link
Member Author

Thoughts?

It sounds like the general consensus is that this belongs in cats-collections then? I confess that is a bit disappointing, since it makes these data structures generally less usable / discoverable. It makes me question the utility of Eq and Hash as a part of cats-core, since it's hard to make use of them consistently.

@armanbilge
Copy link
Member

since it makes these data structures generally less usable / discoverable

I think so too. Unfortunately I know it's a lot more work but I think the best strategy is to make cats-collections an appealing enough dependency that can get on the classpath of e.g. circe and cats-effect-std. This will help make it more available without being put into cats core.

@johnynek
Copy link
Contributor

I still think it is better here...

I think Map/Set are core, like List and NonEmptyList... these are things we want (e.g. the methods we already have that do groupBy and resort to SortedMap).

I think we only need to worry about binary compatibility on the API surface error, and given that Map and Set are very old concepts, I am fairly confident we can make something that works like standard scala ones that we don't need to worry that things will change.

@johnynek
Copy link
Contributor

so... if @armanbilge and I (who actually have spent the time on the code) feel like it should be here, and folks who haven't been reviewing the code and citing specific examples of their concerns in the PR think it shouldn't I don't see why they get the veto.

@gemelen
Copy link

gemelen commented Jul 15, 2022

I think the best strategy is to make cats-collections an appealing enough dependency that can get on the classpath of e.g. circe and cats-effect-std

I'm here just to support this as much as I can, since cats-collection is far from being widespread (last time I researched, it has only one public project that uses it).

@DavidGregory084
Copy link
Member Author

Superseded by typelevel/cats-collections#534

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants