Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cats.kernel.Hash port for Scala CHAMP HashSet #4185

Closed
wants to merge 24 commits into from
Closed

cats.kernel.Hash port for Scala CHAMP HashSet #4185

wants to merge 24 commits into from

Conversation

DavidGregory084
Copy link
Member

@DavidGregory084 DavidGregory084 commented Apr 13, 2022

This PR implements an immutable hash set using the CHAMP encoding and cats.kernel.Hash for hashing of elements, as per #4147.

I've written up some basics about CHAMP here in case anyone wants to take a look.

Big questions remain:

  • Does this belong in cats at all or in cats-collections?
  • What is the minimal interface that we should support?
  • Should the Hash constraint appear at the method level or at the constructor level in HashSet?

Contributed on behalf of the @opencastsoftware open source team. 👋

@DavidGregory084 DavidGregory084 changed the title WIP Initial port of Scala CHAMP HashSet WIP port of Scala CHAMP HashSet Apr 13, 2022
Copy link
Contributor

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very exciting!

I personally don't see a strong reason for cats-collections to exist in a separate repo, but it does. I imagine this will see much more use if added here, but then again, use will probably be small no matter what.

It is very cool. I will give a more detailed review.

core/src/main/scala/cats/data/HashSet.scala Outdated Show resolved Hide resolved
@djspiewak
Copy link
Member

djspiewak commented Apr 14, 2022

I'm strongly of the opinion that cats-collections should remain separate at least until stabilized. The bar is very high for breaking Cats. Too high for a library where things are just getting bootstrapped. I'm also generally in favor of it remaining separate on a permanent basis, but the reasoning is less objectively strong there.

@smarter
Copy link
Contributor

smarter commented Apr 14, 2022

Since this is a port, does this contain code adapted from the Scala standard library? Cats is MIT licensed whereas the Scala standard library is Apache 2 licensed, if you're adapting code from the standard library you probably need to change your license to Apache 2 (or separately license some files as Apache 2? But that seems confusing).

@DavidGregory084
Copy link
Member Author

DavidGregory084 commented Apr 14, 2022

@smarter Good question. I actually referred mostly to @msteindorfer's initial PR to the collections-strawman, but that is also Apache licensed.

  • You must give any other recipients of the Work or Derivative Works a copy of this License; and
  • You must cause any modified files to carry prominent notices stating that You changed the files; and
  • You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
  • If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.

I did not start with any existing source file from the standard library and modify it but rather used Michael's work and the Scala standard library code as inspiration and reference. In many places that essentially means writing exactly the same code, so I don't know if that falls under the "modified files" case but I would certainly say it is a derivative work.

Would it be sufficient to include the Apache license at the root of the cats repo, and add attribution notices to those files based on the scala/scala NOTICE file in addition to the (auto-generated) cats headers in those files?

To be honest it has come as a surprise to me that cats is MIT licensed as the bulk of the ecosystem is on Apache 2.0!

@smarter
Copy link
Contributor

smarter commented Apr 14, 2022

You'd also have to include it in

cats/build.sbt

Line 321 in fe40bc2

ThisBuild / licenses := List(License.MIT)

@satorg
Copy link
Contributor

satorg commented Apr 14, 2022

Just following up this discussion and realized that Cats has some compatibility-related code in the tests which is derived from Scala library.
This code is not a part of any Cats distribution though, but I wonder if we still have to draw it up in some special way?

One of such piece of code which I am personally aware of:

// Scala v2.12.x does not have `distinctBy` implemented.
// Therefore this implementation is copied (and adapted) from Scala Library v2.13.8 sources:
// https://github.com/scala/scala/blob/v2.13.8/src/library/scala/collection/immutable/StrictOptimizedSeqOps.scala#L26-L39
def distinctBy[B](f: A => B)(implicit cbf: CanBuildFrom[C[A], A, C[A]]): C[A] = {
if (self.lengthCompare(1) <= 0) self
else {
val builder = cbf()
val seen = mutable.HashSet.empty[B]
val it = self.iterator
var different = false
while (it.hasNext) {
val next = it.next()
if (seen.add(f(next))) builder += next else different = true
}
if (different) builder.result() else self
}
}

@smarter @DavidGregory084 – wdyt?

@DavidGregory084
Copy link
Member Author

DavidGregory084 commented Apr 15, 2022

Yes you are right @satorg I don't think this is a new problem in cats - see e.g. the file HashCompat in the main branch, which already contains code from the Scala standard library.

EDIT: btw I have raised a new issue for this #4189

core/src/main/scala/cats/data/HashSet.scala Outdated Show resolved Hide resolved
core/src/main/scala/cats/data/HashSet.scala Outdated Show resolved Hide resolved
core/src/main/scala/cats/data/HashSet.scala Outdated Show resolved Hide resolved
core/src/main/scala/cats/data/HashSet.scala Outdated Show resolved Hide resolved
@DavidGregory084
Copy link
Member Author

DavidGregory084 commented Apr 19, 2022

I have made many things private and final. I'm not exactly sure how such modifiers are typically used in cats - is usage of final preferred or does it make things difficult later?

@DavidGregory084
Copy link
Member Author

DavidGregory084 commented Apr 25, 2022

@smarter I have added copyright notices to both this PR and #4193, and opened #4196 to address other instances in cats. In the end we decided not to update the licences key in sbt for the reasons discussed in #4189.

@@ -415,6 +415,12 @@ object arbitrary extends ArbitraryInstances0 with ScalaVersionSpecific.Arbitrary

implicit val catsLawsArbitraryForMiniInt: Arbitrary[MiniInt] =
Arbitrary(Gen.oneOf(MiniInt.allValues))

implicit def catsLawsArbitraryForHashSet[A](implicit A: Arbitrary[A], hash: Hash[A]): Arbitrary[HashSet[A]] =
Arbitrary(getArbitrary[List[A]].map(HashSet.fromSeq(_)(hash)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather see a different way to build a hashset. My opinion is basically this should have all the fundamental operations:

  1. empty
  2. fromSeq
  3. recurse + a
  4. recurse - a
  5. recurse0 | recurse1
  6. recurse0 & recurse1
  7. recurse0 -- recurse1

something like this (but you will need to use Gen.frequency to make sure the probability we branch into two is low enough that this doesn't go on forever.

The idea here is to simulate all the paths of creating a HashSet to make sure there aren't some buggy paths.

@DavidGregory084 DavidGregory084 changed the title WIP port of Scala CHAMP HashSet cats.kernel.Hash port for Scala CHAMP HashSet May 17, 2022
@armanbilge armanbilge added this to the 2.9.0 milestone Jun 11, 2022
@DavidGregory084 DavidGregory084 marked this pull request as ready for review June 16, 2022 15:37
@DavidGregory084
Copy link
Member Author

OK, as per discussion on #4193 I have reopened this as typelevel/cats-collections#533

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants