typeclass-based collections implementations #4147

armanbilge · 2022-03-09T18:04:11Z

This is a recurring idea, most recently by @johnynek in #4118 (comment).

i.e. we should work towards having a typeclass based implementation of HashSet, SortedSet, HashMap, SortedMap that use only typeclasses, not the instance methods.

Although the issue comes up in Cats since this is where we house instances for Scala stdlib, this might be a better PR for cats-collections.

djspiewak · 2022-03-11T14:06:43Z

Definitely better for cats-collections. 🙂 This has been a dream for a while, but it's meaningfully hard to do correctly, and it's not clear to me that people would actually use them over the standard library.

armanbilge · 2022-03-11T14:10:18Z

it's meaningfully hard to do correctly

I thought somewhere someone suggested we could basically copy the implementations out of Scala itself. But I am very afraid of the collections hierarchy 😆

it's not clear to me that people would actually use them over the standard library.

yeah, fair ...

rossabaker · 2022-03-13T03:34:06Z

Precedent: scalaz.Maybe and scalaz.IList got used successfully where people didn't care about interop with stdlib, and were cursed mightily wherever people did.

johnynek · 2022-03-18T18:07:48Z

People definitely won't use what we don't have.

I have definitely sunk more time into code with zero users. I think it is worth it, even if it is niche. Someone could also learn a lot about HAMT by doing it, so it is a great learning experience.

One last note: cats-collections has two immutable heaps. The standard library has zero. No one generally uses them, but I have and I bet a few others. Safe immutable types are a big part of our mission. Let's not give into "no one will use this" defeatism. They may not, but that is not how we measure success.

armanbilge · 2022-03-22T19:37:17Z

A comment by @durban in typelevel/cats-effect#2464 reminded me of this issue:

What I find very strange is that it uses universal equality/hash for the keys, while there is a Hash typeclass in Cats.

Incorporate MapRef into Std cats-effect#2464 (comment)

Understandably, it is difficult to provide even a Ref-based MapRef implementation using Hash when we are lacking a Map based on Hash!

So even if a Cats' Map is something "nobody" would use, MapRef certainly is.

johnynek · 2022-03-22T21:01:32Z

well, if you don't worry to much about performance, here is how:

class Hashed[K](implicit hash: Hash[K]) {

  class Key(val unwrap: K) {
    override val hashCode: Int = hash.hashOf(unwrap)
      that match {
        case hk: HashKey => hash.eqv(unwrap, hk.unwrap)
        case _ => false 
     }
  }

  def toMap[V](kvs: Iterable[K, V]): Map[Key, V] =
    kvs.iterator.map { case (k, v) => (new Key(k), v) }.toMap
}

val keySpace = new Hashed[Foo]

val myMap: Map[keySpace.Key, String] = keySpace.toMap(List(Foo(1) -> "1"))

etc....

so, in other words, you box with something that delegates hashCode and equals to Hash.

johnynek · 2022-03-22T21:02:46Z

I mean, we could add these methods and inner types to Hash[K] now... and have toMap and toSet and it might be pretty usable.

DavidGregory084 · 2022-04-07T14:38:47Z

I think I am going to have a crack at this one unless someone has already taken it?

My plan is to do a pretty conservative port of the HashMap and HashSet from the Scala 2.13 collections initially.

e.g. for HashSet. something like this:

abstract class HashSet[A] {
  final def contains(value: A)(implicit hash: Hash[A]): Boolean

  final def add(value: A)(implicit hash: Hash[A]): HashSet[A] 

  final def remove(value: A)(implicit hash: Hash[A]): HashSet[A] 

  // etc.
}

Essentially, the interface of the existing Scala collections but with the Hash constraint . :)

I have not really looked into any kind of caching of hash code values or the many optimizations that have been added to the 2.13.x collections since the initial CHAMP port as I think it's probably best to do the simplest possible implementation to begin with.

armanbilge · 2022-04-07T14:55:09Z

Hmm, shouldn't the Hash be part of the constructor, instead of passed to each method? That also creates potential for inconsistency between calls.

DavidGregory084 · 2022-04-07T15:20:57Z

I am not really sure - I have seen both approaches and I guess it's just as surprising both ways if the Hash that is used locally and the one that was used when constructing the Set are not consistent - both will produce strange results.
In general though it seems that we use method-level constraints in cats, see e.g. Chain's == and contains.

djspiewak · 2022-04-07T15:25:15Z

In general, passing it to each method is superior because it allows call sites to more correctly reflect their constraints. For example, imagine the constructor of MapRef[F, K, V]. You shouldn't need a Hash[K] to construct a MapRef, only to write key/value pairs. This makes the type signatures of both the functions using the constructor and the functions using the accessors much more parametric in terms of the constraints they declare, which in turn makes it easier to infer their semantics.

armanbilge · 2022-04-07T15:37:34Z

Ah, ok, thanks both. That makes sense, and I see that cats-collections also uses method-level constraints. 👍

durban · 2022-04-08T16:28:08Z

On the other hand, if MapRef looks like this:

trait MapRef[F[_], K, V] {
  def lookup(k: K)(implicit K: Hash[K]): Ref[F, V]
}

then we may also need OrderMapRef (which may store them in a tree):

trait OrderMapRef[F[_], K, V] {
  def lookup(k: K)(implicit K: Order[K]): Ref[F, V]
}

If the 2 constructors receive the Hash/Order instead, then a single MapRef trait can be used by both.

satorg · 2022-04-08T20:31:43Z

~~I have got a quite crazy idea on how we could approach that. What if we considered generalizing it:~~

trait GenMap[C[_], K, V] {
  def get(k: K)(implicit C: C[K]): Option[V]
  def updated(k: K, v: V)(implicit C: C[K]): GenMap[C, K, V]

  def toMap(implicit C: C[K]): Map[K, V] // latches the constraint `C` inside a regular `Map` interface implementation
                                         // can be implemented efficiently as a view of the original `GenMap`
                                         // thus avoiding copying of any data
  
  def toGenMap[CC[_]]: GenMap[CC, K, V] // changes the constraint type
                                        // also should be possible to avoid copying of data
}

type HashMap[K, V] = GenMap[Hash, K, V]
type OrderMap[K, V] = GenMap[Order, K, V]

UPD. Disregard please. Written due to a brain fluctuation. Sorry for the disturbance :)

johnynek · 2022-04-09T00:32:20Z

how can you convert between HashMap[K, V] and OrderMap[K, V] without any copying?

Not only does the internal structure depend on the kind of map it is (sorted or hash) it depends on the value of the sorting and hashing (of which there can be many due to no coherent typeclasses in scala).

I continue to think this approach (each callsite takes the dependency) is actually a haskell cargo-cult and not a great design in scala.

The internal structure only makes sense when you use the right value for the typeclass. By not keeping it together you are just hoping that users don't pass the wrong ones.

I think binding the typeclass instance to the value like scala does with SortedMap and SortedSet is actually the right way to go even though it isn't how haskell works (in part because haskell does not capture typeclasses reified into values very often, although it is possible).

satorg · 2022-04-09T00:32:35Z

Oh no, disregard please. It won't work, I didn't think through enough)

satorg · 2022-04-09T00:50:35Z

The internal structure only makes sense when you use the right value for the typeclass. By not keeping it together you are just hoping that users don't pass the wrong ones.

Yes, I agree it is a good point. It is just safer to bind internal structure and a constraint together.

In fact, we cannot even construct a new instance (besides an empty one) without passing the constraint to the constructor.

satorg · 2022-04-12T16:29:50Z

Would it make sense to convert this issue to a "discussion" (the next tab to "Pull requests" in the Github tabs panel)? Because it looks like a discussion rather than an issue :)

DavidGregory084 · 2022-04-12T21:59:16Z

I continue to think this approach (each callsite takes the dependency) is actually a haskell cargo-cult and not a great design in scala.

It took me a while to recall where I had seen a proper explanation for this Haskell convention - the best I can find is from Learn You A Haskell:

However, it's a very strong convention in Haskell to never add typeclass constraints in data declarations. Why? Well, because we don't benefit a lot, but we end up writing more class constraints, even when we don't need them. If we put or don't put the Ord k constraint in the data declaration for Map k v, we're going to have to put the constraint into functions that assume the keys in a map can be ordered. But if we don't put the constraint in the data declaration, we don't have to put (Ord k) => in the type declarations of functions that don't care whether the keys can be ordered or not. An example of such a function is toList, that just takes a mapping and converts it to an associative list. Its type signature is toList :: Map k a -> [(k, a)]. If Map k v had a type constraint in its data declaration, the type for toList would have to be toList :: (Ord k) => Map k a -> [(k, a)], even though the function doesn't do any comparing of keys by order.

So it seems that the reasoning in Haskell is not just about enabling appropriately specific constraints to be used for different functions, but that when using data type constraints in Haskell you still have to write out the constraint in every signature that uses that data type, so there is simply no benefit from doing so.

The idea that using function-level constraints allows better documentation of the requirements for a particular function makes a lot of sense to me, but it seems much less beneficial in this case since we are talking about adding the same Hash constraint to almost every method of an immutable HashSet or HashMap.

I think binding the typeclass instance to the value like scala does with SortedMap and SortedSet is actually the right way to go even though it isn't how haskell works (in part because haskell does not capture typeclasses reified into values very often, although it is possible).

I think your reasoning is pretty convincing overall @johnynek. The main counter-argument I can think of is code like contains_ from cats:

def contains_(v: A)(implicit ev: Eq[A], F: Foldable[F]): Boolean =
  F.exists(fa)(ev.eqv(_, v))

hashSet.contains_(1) // The type class extension method uses the `Eq[Int]` in scope at this callsite
hashSet.contains(1) // The instance method uses the `Eq[Int]` bound in the `hashSet`

I don't mean to focus too much on this specific example - just that binding type class instances at class level increases the likelihood that code which uses HashSet via type class methods might be inconsistent with code using its own instance methods, and it poses questions about how instances for data types like HashSet should implement those type class methods - simply ignore the provided Eq or Hash parameter?

DavidGregory084 · 2022-04-13T18:47:42Z

I've pushed a WIP PR for the HashSet implementation under #4185, interested in all of your thoughts about it!

I'd like to write a bunch more tests and add some benchmarks for what is there now, but we also need to decide on a few things:

Does this belong in cats at all or in cats-collections? I have no opinion about this other than I'm not aware that cats-collections has many users.
What is the minimal set of operations that we should support? There are a bunch of major things (subSetOf, diff, filter) that I haven't yet implemented.
Should the Hash constraint appear at the method level or at the constructor level in HashSet? FWIW, one nuance of this is that without using a constructor level constraint, you cannot implement hashCode() using cats.kernel.Hash.

armanbilge · 2022-04-15T20:26:00Z

Btw, we also have to do something about this law:

cats/kernel-laws/shared/src/main/scala/cats/kernel/laws/HashLaws.scala

Lines 33 to 35 in fe40bc2

    
           def sameAsUniversalHash(x: A, y: A): IsEq[Boolean] = 
        
             ((E.hash(x) == x.hashCode) && (Hash.fromUniversalHashCode[A].hash(x) == x.hashCode()) && 
        
               (E.eqv(x, y) == Hash.fromUniversalHashCode[A].eqv(x, y))) <-> true

See discussion in #4118 (comment).

DavidGregory084 · 2022-04-21T20:33:16Z

I've pushed #4193 containing a hash map implementation along the same lines.

Some things I still need to do:

address Ensure attribution of code adapted from Scala standard library #4189 in both PRs
add benchmarks for cats.kernel.Hash port for Scala CHAMP HashMap #4193
re-run benchmarks for cats.kernel.Hash port for Scala CHAMP HashSet #4185 as I just fixed a bug in remove (unfortunately this takes ~30hrs)

johnynek · 2022-08-17T17:30:37Z

should we close this since it was ultimately rejected and pushed to cats-collections?

johnynek · 2022-08-17T17:40:30Z

Also note since this was opened

cats/core/src/main/scala/cats/TraverseFilter.scala

Line 127 in 83ad02b

def hashDistinct[A](fa: F[A])(implicit H: Hash[A]): F[A] =

hashDistinct was merged. That conflates Hash[A] with a.hashCode which is not true for many interesting opaque type values that we might want to use. We could solve this example by making a wrapper that delegates to Hash for equality and hashing at the cost of an extra allocation. Given that we are using State I don't think that will be a hit on perf.

DavidGregory084 · 2022-08-22T13:10:02Z

The HashMap / HashSet portion of this has been contributed to cats-collections now - see typelevel/cats-collections#534 and typelevel/cats-collections#533. I am not sure if I will have an opportunity to work on a SortedSet or SortedMap soon so I think this should stay open but I have unassigned myself for the moment.

armanbilge · 2022-08-22T13:13:11Z

Do we actually need a specialized SortedMap or SortedSet? Since it's possible to override the Ordering in the existing Scala implementations, unlike their hash-counterparts.

DavidGregory084 · 2022-08-22T13:17:37Z

Yes that's a good point - I suppose the problem is that the hashCode calculation cannot be overridden to use a cats.kernel.Hash.

armanbilge · 2022-08-22T13:20:20Z

Eh, I wouldn't really consider that a problem. If you use the cats Hash instance (instead of the universal hashCode method) it will do the right thing.

armanbilge added enhancement help wanted labels Mar 9, 2022

armanbilge assigned DavidGregory084 Apr 7, 2022

armanbilge mentioned this issue Apr 12, 2022

Move contain_ to UnorderedFoldable #4183

Merged

DavidGregory084 mentioned this issue Apr 13, 2022

cats.kernel.Hash port for Scala CHAMP HashSet #4185

Closed

satorg mentioned this issue Apr 16, 2022

Add method distinct to TraverseFilter #3603

Merged

DavidGregory084 mentioned this issue Apr 21, 2022

cats.kernel.Hash port for Scala CHAMP HashMap #4193

Closed

johnynek mentioned this issue Aug 17, 2022

remove Hash law that it equals .hashCode #4284

Closed

DavidGregory084 removed their assignment Aug 22, 2022

armanbilge mentioned this issue Oct 26, 2022

1.0 typelevel/cats-collections#277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

typeclass-based collections implementations #4147

typeclass-based collections implementations #4147

armanbilge commented Mar 9, 2022 •

edited

Loading

djspiewak commented Mar 11, 2022

armanbilge commented Mar 11, 2022

rossabaker commented Mar 13, 2022

johnynek commented Mar 18, 2022

armanbilge commented Mar 22, 2022 •

edited

Loading

johnynek commented Mar 22, 2022

johnynek commented Mar 22, 2022

DavidGregory084 commented Apr 7, 2022 •

edited

Loading

armanbilge commented Apr 7, 2022

DavidGregory084 commented Apr 7, 2022

djspiewak commented Apr 7, 2022

armanbilge commented Apr 7, 2022

durban commented Apr 8, 2022

satorg commented Apr 8, 2022 •

edited

Loading

johnynek commented Apr 9, 2022

satorg commented Apr 9, 2022

satorg commented Apr 9, 2022

satorg commented Apr 12, 2022

DavidGregory084 commented Apr 12, 2022 •

edited

Loading

DavidGregory084 commented Apr 13, 2022 •

edited

Loading

armanbilge commented Apr 15, 2022

DavidGregory084 commented Apr 21, 2022 •

edited

Loading

johnynek commented Aug 17, 2022

johnynek commented Aug 17, 2022

DavidGregory084 commented Aug 22, 2022

armanbilge commented Aug 22, 2022

DavidGregory084 commented Aug 22, 2022

armanbilge commented Aug 22, 2022

typeclass-based collections implementations #4147

typeclass-based collections implementations #4147

Comments

armanbilge commented Mar 9, 2022 • edited Loading

djspiewak commented Mar 11, 2022

armanbilge commented Mar 11, 2022

rossabaker commented Mar 13, 2022

johnynek commented Mar 18, 2022

armanbilge commented Mar 22, 2022 • edited Loading

johnynek commented Mar 22, 2022

johnynek commented Mar 22, 2022

DavidGregory084 commented Apr 7, 2022 • edited Loading

armanbilge commented Apr 7, 2022

DavidGregory084 commented Apr 7, 2022

djspiewak commented Apr 7, 2022

armanbilge commented Apr 7, 2022

durban commented Apr 8, 2022

satorg commented Apr 8, 2022 • edited Loading

johnynek commented Apr 9, 2022

satorg commented Apr 9, 2022

satorg commented Apr 9, 2022

satorg commented Apr 12, 2022

DavidGregory084 commented Apr 12, 2022 • edited Loading

DavidGregory084 commented Apr 13, 2022 • edited Loading

armanbilge commented Apr 15, 2022

DavidGregory084 commented Apr 21, 2022 • edited Loading

johnynek commented Aug 17, 2022

johnynek commented Aug 17, 2022

DavidGregory084 commented Aug 22, 2022

armanbilge commented Aug 22, 2022

DavidGregory084 commented Aug 22, 2022

armanbilge commented Aug 22, 2022

armanbilge commented Mar 9, 2022 •

edited

Loading

armanbilge commented Mar 22, 2022 •

edited

Loading

DavidGregory084 commented Apr 7, 2022 •

edited

Loading

satorg commented Apr 8, 2022 •

edited

Loading

DavidGregory084 commented Apr 12, 2022 •

edited

Loading

DavidGregory084 commented Apr 13, 2022 •

edited

Loading

DavidGregory084 commented Apr 21, 2022 •

edited

Loading