Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

typeclass-based collections implementations #4147

Open
armanbilge opened this issue Mar 9, 2022 · 28 comments
Open

typeclass-based collections implementations #4147

armanbilge opened this issue Mar 9, 2022 · 28 comments

Comments

@armanbilge
Copy link
Member

armanbilge commented Mar 9, 2022

This is a recurring idea, most recently by @johnynek in #4118 (comment).

i.e. we should work towards having a typeclass based implementation of HashSet, SortedSet, HashMap, SortedMap that use only typeclasses, not the instance methods.

Although the issue comes up in Cats since this is where we house instances for Scala stdlib, this might be a better PR for cats-collections.

@djspiewak
Copy link
Member

Definitely better for cats-collections. 🙂 This has been a dream for a while, but it's meaningfully hard to do correctly, and it's not clear to me that people would actually use them over the standard library.

@armanbilge
Copy link
Member Author

it's meaningfully hard to do correctly

I thought somewhere someone suggested we could basically copy the implementations out of Scala itself. But I am very afraid of the collections hierarchy 😆

it's not clear to me that people would actually use them over the standard library.

yeah, fair ...

@rossabaker
Copy link
Member

Precedent: scalaz.Maybe and scalaz.IList got used successfully where people didn't care about interop with stdlib, and were cursed mightily wherever people did.

@johnynek
Copy link
Contributor

People definitely won't use what we don't have.

I have definitely sunk more time into code with zero users. I think it is worth it, even if it is niche. Someone could also learn a lot about HAMT by doing it, so it is a great learning experience.

One last note: cats-collections has two immutable heaps. The standard library has zero. No one generally uses them, but I have and I bet a few others. Safe immutable types are a big part of our mission. Let's not give into "no one will use this" defeatism. They may not, but that is not how we measure success.

@armanbilge
Copy link
Member Author

armanbilge commented Mar 22, 2022

A comment by @durban in typelevel/cats-effect#2464 reminded me of this issue:

What I find very strange is that it uses universal equality/hash for the keys, while there is a Hash typeclass in Cats.

Understandably, it is difficult to provide even a Ref-based MapRef implementation using Hash when we are lacking a Map based on Hash!

So even if a Cats' Map is something "nobody" would use, MapRef certainly is.

@johnynek
Copy link
Contributor

well, if you don't worry to much about performance, here is how:

class Hashed[K](implicit hash: Hash[K]) {

  class Key(val unwrap: K) {
    override val hashCode: Int = hash.hashOf(unwrap)
      that match {
        case hk: HashKey => hash.eqv(unwrap, hk.unwrap)
        case _ => false 
     }
  }

  def toMap[V](kvs: Iterable[K, V]): Map[Key, V] =
    kvs.iterator.map { case (k, v) => (new Key(k), v) }.toMap
}

val keySpace = new Hashed[Foo]

val myMap: Map[keySpace.Key, String] = keySpace.toMap(List(Foo(1) -> "1"))

etc....

so, in other words, you box with something that delegates hashCode and equals to Hash.

@johnynek
Copy link
Contributor

I mean, we could add these methods and inner types to Hash[K] now... and have toMap and toSet and it might be pretty usable.

@DavidGregory084
Copy link
Member

DavidGregory084 commented Apr 7, 2022

I think I am going to have a crack at this one unless someone has already taken it?

My plan is to do a pretty conservative port of the HashMap and HashSet from the Scala 2.13 collections initially.

e.g. for HashSet. something like this:

abstract class HashSet[A] {
  final def contains(value: A)(implicit hash: Hash[A]): Boolean

  final def add(value: A)(implicit hash: Hash[A]): HashSet[A] 

  final def remove(value: A)(implicit hash: Hash[A]): HashSet[A] 

  // etc.
}

Essentially, the interface of the existing Scala collections but with the Hash constraint . :)

I have not really looked into any kind of caching of hash code values or the many optimizations that have been added to the 2.13.x collections since the initial CHAMP port as I think it's probably best to do the simplest possible implementation to begin with.

@armanbilge
Copy link
Member Author

Hmm, shouldn't the Hash be part of the constructor, instead of passed to each method? That also creates potential for inconsistency between calls.

@DavidGregory084
Copy link
Member

I am not really sure - I have seen both approaches and I guess it's just as surprising both ways if the Hash that is used locally and the one that was used when constructing the Set are not consistent - both will produce strange results.
In general though it seems that we use method-level constraints in cats, see e.g. Chain's == and contains.

@djspiewak
Copy link
Member

In general, passing it to each method is superior because it allows call sites to more correctly reflect their constraints. For example, imagine the constructor of MapRef[F, K, V]. You shouldn't need a Hash[K] to construct a MapRef, only to write key/value pairs. This makes the type signatures of both the functions using the constructor and the functions using the accessors much more parametric in terms of the constraints they declare, which in turn makes it easier to infer their semantics.

@armanbilge
Copy link
Member Author

Ah, ok, thanks both. That makes sense, and I see that cats-collections also uses method-level constraints. 👍

@durban
Copy link
Contributor

durban commented Apr 8, 2022

On the other hand, if MapRef looks like this:

trait MapRef[F[_], K, V] {
  def lookup(k: K)(implicit K: Hash[K]): Ref[F, V]
}

then we may also need OrderMapRef (which may store them in a tree):

trait OrderMapRef[F[_], K, V] {
  def lookup(k: K)(implicit K: Order[K]): Ref[F, V]
}

If the 2 constructors receive the Hash/Order instead, then a single MapRef trait can be used by both.

@satorg
Copy link
Contributor

satorg commented Apr 8, 2022

I have got a quite crazy idea on how we could approach that. What if we considered generalizing it:

trait GenMap[C[_], K, V] {
  def get(k: K)(implicit C: C[K]): Option[V]
  def updated(k: K, v: V)(implicit C: C[K]): GenMap[C, K, V]

  def toMap(implicit C: C[K]): Map[K, V] // latches the constraint `C` inside a regular `Map` interface implementation
                                         // can be implemented efficiently as a view of the original `GenMap`
                                         // thus avoiding copying of any data
  
  def toGenMap[CC[_]]: GenMap[CC, K, V] // changes the constraint type
                                        // also should be possible to avoid copying of data
}

type HashMap[K, V] = GenMap[Hash, K, V]
type OrderMap[K, V] = GenMap[Order, K, V]

UPD. Disregard please. Written due to a brain fluctuation. Sorry for the disturbance :)

@johnynek
Copy link
Contributor

johnynek commented Apr 9, 2022

how can you convert between HashMap[K, V] and OrderMap[K, V] without any copying?

Not only does the internal structure depend on the kind of map it is (sorted or hash) it depends on the value of the sorting and hashing (of which there can be many due to no coherent typeclasses in scala).

I continue to think this approach (each callsite takes the dependency) is actually a haskell cargo-cult and not a great design in scala.

The internal structure only makes sense when you use the right value for the typeclass. By not keeping it together you are just hoping that users don't pass the wrong ones.

I think binding the typeclass instance to the value like scala does with SortedMap and SortedSet is actually the right way to go even though it isn't how haskell works (in part because haskell does not capture typeclasses reified into values very often, although it is possible).

@satorg
Copy link
Contributor

satorg commented Apr 9, 2022

Oh no, disregard please. It won't work, I didn't think through enough)

@satorg
Copy link
Contributor

satorg commented Apr 9, 2022

The internal structure only makes sense when you use the right value for the typeclass. By not keeping it together you are just hoping that users don't pass the wrong ones.

Yes, I agree it is a good point. It is just safer to bind internal structure and a constraint together.

In fact, we cannot even construct a new instance (besides an empty one) without passing the constraint to the constructor.

@satorg
Copy link
Contributor

satorg commented Apr 12, 2022

Would it make sense to convert this issue to a "discussion" (the next tab to "Pull requests" in the Github tabs panel)? Because it looks like a discussion rather than an issue :)

@DavidGregory084
Copy link
Member

DavidGregory084 commented Apr 12, 2022

I continue to think this approach (each callsite takes the dependency) is actually a haskell cargo-cult and not a great design in scala.

It took me a while to recall where I had seen a proper explanation for this Haskell convention - the best I can find is from Learn You A Haskell:

However, it's a very strong convention in Haskell to never add typeclass constraints in data declarations. Why? Well, because we don't benefit a lot, but we end up writing more class constraints, even when we don't need them. If we put or don't put the Ord k constraint in the data declaration for Map k v, we're going to have to put the constraint into functions that assume the keys in a map can be ordered. But if we don't put the constraint in the data declaration, we don't have to put (Ord k) => in the type declarations of functions that don't care whether the keys can be ordered or not. An example of such a function is toList, that just takes a mapping and converts it to an associative list. Its type signature is toList :: Map k a -> [(k, a)]. If Map k v had a type constraint in its data declaration, the type for toList would have to be toList :: (Ord k) => Map k a -> [(k, a)], even though the function doesn't do any comparing of keys by order.

So it seems that the reasoning in Haskell is not just about enabling appropriately specific constraints to be used for different functions, but that when using data type constraints in Haskell you still have to write out the constraint in every signature that uses that data type, so there is simply no benefit from doing so.

The idea that using function-level constraints allows better documentation of the requirements for a particular function makes a lot of sense to me, but it seems much less beneficial in this case since we are talking about adding the same Hash constraint to almost every method of an immutable HashSet or HashMap.

I think binding the typeclass instance to the value like scala does with SortedMap and SortedSet is actually the right way to go even though it isn't how haskell works (in part because haskell does not capture typeclasses reified into values very often, although it is possible).

I think your reasoning is pretty convincing overall @johnynek. The main counter-argument I can think of is code like contains_ from cats:

def contains_(v: A)(implicit ev: Eq[A], F: Foldable[F]): Boolean =
  F.exists(fa)(ev.eqv(_, v))

hashSet.contains_(1) // The type class extension method uses the `Eq[Int]` in scope at this callsite
hashSet.contains(1) // The instance method uses the `Eq[Int]` bound in the `hashSet`

I don't mean to focus too much on this specific example - just that binding type class instances at class level increases the likelihood that code which uses HashSet via type class methods might be inconsistent with code using its own instance methods, and it poses questions about how instances for data types like HashSet should implement those type class methods - simply ignore the provided Eq or Hash parameter?

@DavidGregory084
Copy link
Member

DavidGregory084 commented Apr 13, 2022

I've pushed a WIP PR for the HashSet implementation under #4185, interested in all of your thoughts about it!

I'd like to write a bunch more tests and add some benchmarks for what is there now, but we also need to decide on a few things:

  • Does this belong in cats at all or in cats-collections? I have no opinion about this other than I'm not aware that cats-collections has many users.
  • What is the minimal set of operations that we should support? There are a bunch of major things (subSetOf, diff, filter) that I haven't yet implemented.
  • Should the Hash constraint appear at the method level or at the constructor level in HashSet? FWIW, one nuance of this is that without using a constructor level constraint, you cannot implement hashCode() using cats.kernel.Hash.

@armanbilge
Copy link
Member Author

Btw, we also have to do something about this law:

def sameAsUniversalHash(x: A, y: A): IsEq[Boolean] =
((E.hash(x) == x.hashCode) && (Hash.fromUniversalHashCode[A].hash(x) == x.hashCode()) &&
(E.eqv(x, y) == Hash.fromUniversalHashCode[A].eqv(x, y))) <-> true

See discussion in #4118 (comment).

@DavidGregory084
Copy link
Member

DavidGregory084 commented Apr 21, 2022

I've pushed #4193 containing a hash map implementation along the same lines.

Some things I still need to do:

@johnynek
Copy link
Contributor

should we close this since it was ultimately rejected and pushed to cats-collections?

@johnynek
Copy link
Contributor

Also note since this was opened

def hashDistinct[A](fa: F[A])(implicit H: Hash[A]): F[A] =

hashDistinct was merged. That conflates Hash[A] with a.hashCode which is not true for many interesting opaque type values that we might want to use. We could solve this example by making a wrapper that delegates to Hash for equality and hashing at the cost of an extra allocation. Given that we are using State I don't think that will be a hit on perf.

@DavidGregory084
Copy link
Member

The HashMap / HashSet portion of this has been contributed to cats-collections now - see typelevel/cats-collections#534 and typelevel/cats-collections#533. I am not sure if I will have an opportunity to work on a SortedSet or SortedMap soon so I think this should stay open but I have unassigned myself for the moment.

@armanbilge
Copy link
Member Author

Do we actually need a specialized SortedMap or SortedSet? Since it's possible to override the Ordering in the existing Scala implementations, unlike their hash-counterparts.

@DavidGregory084
Copy link
Member

Yes that's a good point - I suppose the problem is that the hashCode calculation cannot be overridden to use a cats.kernel.Hash.

@armanbilge
Copy link
Member Author

Eh, I wouldn't really consider that a problem. If you use the cats Hash instance (instead of the universal hashCode method) it will do the right thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants