-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Chain
Documentation
#4386
Changes from all commits
1e463e0
6b56868
58de3cf
8e0a924
8925f2b
b63d539
432306a
356c95d
41d4a3f
fae4e93
db9a396
d1e8f62
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,65 +1,80 @@ | ||
# Chain | ||
|
||
`Chain` is a data structure that allows constant time prepending and appending. | ||
This makes it especially efficient when used as a `Monoid`, e.g. with `Validated` or `Writer`. | ||
As such it aims to be used where `List` and `Vector` incur a performance penalty. | ||
API Documentation: @:api(cats.data.Chain) | ||
|
||
`Chain` is an immutable sequence data structure that allows constant time prepending, appending and concatenation. | ||
This makes it especially efficient when used as a [Monoid], e.g. with [Validated] or [Writer]. | ||
As such it aims to be used where @:api(scala.collection.immutable.List) and @:api(scala.collection.immutable.Vector) incur a performance penalty. | ||
Cats also includes type class implementations to support using `Chain` as a general-purpose collection type, including [Traverse], [Monad], and [Alternative]. | ||
|
||
## Motivation | ||
|
||
`List` is a great data type, it is very simple and easy to understand. | ||
It has very low overhead for the most important functions such as `fold` and `map` and also supports prepending a single element in constant time. | ||
It has very low overhead for the most important functions such as [fold][Foldable] and [map][Functor] and also supports prepending a single element in constant time. | ||
|
||
Traversing a data structure with something like `Writer[List[Log], A]` or `ValidatedNel[Error, A]` is powerful and allows us to precisely specify what kind of iteration we want to do while remaining succint. | ||
Traversing a data structure with something like [Writer\[List\[Log\], A\]][Writer] or [ValidatedNel\[Error, A\]][Validated] is powerful and allows us to precisely specify what kind of iteration we want to do while remaining succinct. | ||
However, in terms of efficiency it's a whole different story unfortunately. | ||
That is because both of these traversals make use of the `List` monoid (or the `NonEmptyList` semigroup), which by the nature of `List` is very inefficient. | ||
If you use `traverse` with a data structure with `n` elements and `Writer` or `Validated` as the `Applicative` type, you will end up with a runtime of `O(n^2)`. | ||
That is because both of these traversals make use of the `List` monoid (or the [NonEmptyList] semigroup), which by the nature of `List` is very inefficient. | ||
If you use [traverse][Traverse] with a data structure with `n` elements and [Writer] or [Validated] as the [Applicative] type, you will end up with a runtime of `O(n^2)`. | ||
This is because, with `List`, appending a single element requires iterating over the entire data structure and therefore takes linear time. | ||
|
||
So `List` isn't all that great for this use case, so let's use `Vector` or `NonEmptyVector` instead, right? | ||
So @:api(scala.collection.immutable.List) isn't all that great for this use case, so let's use @:api(scala.collection.immutable.Vector) or @:api(cats.data.NonEmptyVector)` instead, right? | ||
|
||
Well, `Vector` has its own problems and in this case it's unfortunately not that much faster than `List` at all. You can check [this blog post](http://www.lihaoyi.com/post/BenchmarkingScalaCollections.html#vectors-are-ok) by Li Haoyi for some deeper insight into `Vector`'s issues. | ||
|
||
|
||
`Chain` evolved from what used to be `fs2.Catenable` and Erik Osheim's [Chain](https://github.com/non/chain ) library. | ||
Similar to `List`, it is also a very simple data structure, but unlike `List` it supports both constant O(1) time `append` and `prepend`. | ||
This makes its `Monoid` instance super performant and a much better fit for usage with `Validated`,`Writer`, `Ior` or `Const`. | ||
`Chain` evolved from what used to be `fs2.Catenable` and Erik Osheim's [Chain](https://github.com/non/chain) library. | ||
Similar to `List`, it is also a very simple data structure, but unlike `List` it supports constant O(1) time `append`, `prepend` and `concat`. | ||
This makes its [Monoid] instance [super performant][Benchmarks] and a much better fit for usage with [Validated], [Writer], [Ior] or [Const]. | ||
|
||
To utilize this Cats includes type aliases like `ValidatedNec` or `IorNec` as well as helper functions like `groupByNec` or `Validated.invalidNec`. | ||
reardonj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
To get a good idea of the performance improvements, here are some benchmarks that test monoidal append (higher score is better): | ||
## NonEmptyChain | ||
|
||
``` | ||
[info] Benchmark Mode Cnt Score Error Units | ||
[info] CollectionMonoidBench.accumulateChain thrpt 20 51.911 ± 7.453 ops/s | ||
[info] CollectionMonoidBench.accumulateList thrpt 20 6.973 ± 0.781 ops/s | ||
[info] CollectionMonoidBench.accumulateVector thrpt 20 6.304 ± 0.129 ops/s | ||
``` | ||
[NonEmptyChain][nec] is the non-empty version of `Chain`. | ||
It does not have a [Monoid] instance since it cannot be empty, but it does have a [Semigroup] instance. | ||
Likewise, it defines a [NonEmptyTraverse] instance, but no @:api(cats.TraverseFilter) instance. | ||
|
||
As you can see accumulating things with `Chain` is more than 7 times faster than `List` and over 8 times faster than `Vector`. | ||
So appending is a lot more performant than the standard library collections, but what about operations like `map` or `fold`? | ||
Fortunately we've also benchmarked these (again, higher score is better): | ||
To simplify the usage of `NonEmptyChain`, Cats includes type aliases like [ValidatedNec](validated.md#meeting-applicative) and [IorNec](ior.md#using-with-nonemptychain), as well as helper functions like `groupByNec` and `Validated.invalidNec`. | ||
|
||
There are numerous ways to construct a `NonEmptyChain`, e.g. you can create one from a single element, a `NonEmptyList` or a `NonEmptyVector`: | ||
|
||
```scala mdoc | ||
import cats.data._ | ||
|
||
NonEmptyChain(1, 2, 3, 4) | ||
|
||
NonEmptyChain.fromNonEmptyList(NonEmptyList(1, List(2, 3))) | ||
NonEmptyChain.fromNonEmptyVector(NonEmptyVector(1, Vector(2, 3))) | ||
|
||
NonEmptyChain.one(1) | ||
``` | ||
[info] Benchmark Mode Cnt Score Error Units | ||
[info] ChainBench.foldLeftLargeChain thrpt 20 117.267 ± 1.815 ops/s | ||
[info] ChainBench.foldLeftLargeList thrpt 20 135.954 ± 3.340 ops/s | ||
[info] ChainBench.foldLeftLargeVector thrpt 20 61.613 ± 1.326 ops/s | ||
[info] | ||
[info] ChainBench.mapLargeChain thrpt 20 59.379 ± 0.866 ops/s | ||
[info] ChainBench.mapLargeList thrpt 20 66.729 ± 7.165 ops/s | ||
[info] ChainBench.mapLargeVector thrpt 20 61.374 ± 2.004 ops/s | ||
|
||
|
||
|
||
You can also create an @:api(scala.Option) of `NonEmptyChain` from a `Chain` or any other collection type: | ||
|
||
```scala mdoc | ||
import cats.data._ | ||
|
||
NonEmptyChain.fromChain(Chain(1, 2, 3)) | ||
NonEmptyChain.fromSeq(List.empty[Int]) | ||
NonEmptyChain.fromSeq(Vector(1, 2, 3)) | ||
``` | ||
|
||
While not as dominant, `Chain` holds its ground fairly well. | ||
It won't have the random access performance of something like `Vector`, but in a lot of other cases, `Chain` seems to outperform it quite handily. | ||
So if you don't perform a lot of random access on your data structure, then you should be fine using `Chain` extensively instead. | ||
Sometimes, you'll want to prepend or append a single element to a chain and return the result as a `NonEmptyChain`: | ||
|
||
So next time you write any code that uses `List` or `Vector` as a `Monoid`, be sure to use `Chain` instead! | ||
You can also check out the benchmarks [here](https://github.com/typelevel/cats/blob/v1.3.0/bench/src/main/scala/cats/bench). | ||
```scala mdoc | ||
import cats.data._ | ||
|
||
NonEmptyChain.fromChainAppend(Chain(1, 2, 3), 4) | ||
NonEmptyChain.fromChainAppend(Chain.empty[Int], 1) | ||
NonEmptyChain.fromChainPrepend(1, Chain(2, 3)) | ||
``` | ||
## How it works | ||
|
||
`Chain` is a fairly simple data structure compared to something like `Vector`. | ||
It's a simple ADT that has only 4 cases. | ||
It is either an empty `Chain` with no elements, a singleton `Chain` with exactly one element, a concatenation of two chains or a wrapper for another collection. | ||
`Chain` is implemented as a simple unbalanced binary tree ADT with four cases: | ||
an empty `Chain` with no elements, a singleton `Chain` with exactly one element, a concatenation of two chains, or a wrapper for a @:api(scala.collection.immutable.Seq). | ||
|
||
In code it looks like this: | ||
|
||
```scala mdoc | ||
|
@@ -72,7 +87,7 @@ case class Wrap[A](seq: Seq[A]) extends Chain[A] | |
``` | ||
|
||
The `Append` constructor is what gives us the fast concatenation ability. | ||
Concatenating two existing `Chain`s, is just a call to the `Append` constructor, which is always constant time `O(1)`. | ||
Concatenating two existing `Chain`s is just a call to the `Append` constructor, which is always constant time `O(1)`. | ||
|
||
In case we want to append or prepend a single element, | ||
all we have to do is wrap the element with the `Singleton` constructor and then use the `Append` constructor to append or prepend the `Singleton` `Chain`. | ||
|
@@ -100,48 +115,68 @@ def fromSeq[A](s: Seq[A]): Chain[A] = | |
else Wrap(s) | ||
``` | ||
|
||
|
||
|
||
In conclusion `Chain` supports constant time appending and prepending, because it builds an unbalance tree of `Append`s. | ||
In conclusion `Chain` supports constant time concatenation, because it builds an unbalance tree of `Append`s. | ||
`append` and `prepend` are treated as concatenation with single element collection to keep the same performance characteristics. | ||
This unbalanced tree will always allow iteration in linear time. | ||
|
||
## Benchmarks | ||
|
||
## NonEmptyChain | ||
|
||
`NonEmptyChain` is the non empty version of `Chain` it does not have a `Monoid` instance since it cannot be empty, but it does have a `Semigroup` instance. | ||
Likewise, it defines a `NonEmptyTraverse` instance, but no `TraverseFilter` instance. | ||
|
||
There are numerous ways to construct a `NonEmptyChain`, e.g. you can create one from a single element, a `NonEmptyList` or a `NonEmptyVector`: | ||
|
||
```scala mdoc | ||
import cats.data._ | ||
|
||
NonEmptyChain(1, 2, 3, 4) | ||
To get a good idea of performance of `Chain`, here are some benchmarks that test monoidal append (higher score is better): | ||
|
||
NonEmptyChain.fromNonEmptyList(NonEmptyList(1, List(2, 3))) | ||
NonEmptyChain.fromNonEmptyVector(NonEmptyVector(1, Vector(2, 3))) | ||
|
||
NonEmptyChain.one(1) | ||
``` | ||
Benchmark Mode Cnt Score Error Units | ||
CollectionMonoidBench.accumulateChain thrpt 25 81.973 ± 3.921 ops/s | ||
CollectionMonoidBench.accumulateList thrpt 25 21.150 ± 1.756 ops/s | ||
CollectionMonoidBench.accumulateVector thrpt 25 11.725 ± 0.306 ops/s | ||
``` | ||
|
||
As you can see accumulating things with `Chain` is almost 4 times faster than `List` and nearly 8 times faster than `Vector`. | ||
So appending is a lot more performant than the standard library collections, but what about operations like `map` or `fold`? | ||
Fortunately we've also benchmarked these (again, higher score is better): | ||
|
||
``` | ||
Benchmark Mode Cnt Score Error Units | ||
ChainBench.consLargeChain thrpt 25 143759156.264 ± 5611584.788 ops/s | ||
ChainBench.consLargeList thrpt 25 148512687.273 ± 5992793.489 ops/s | ||
ChainBench.consLargeVector thrpt 25 7249505.257 ± 202436.549 ops/s | ||
ChainBench.consSmallChain thrpt 25 119925876.637 ± 1663011.363 ops/s | ||
ChainBench.consSmallList thrpt 25 152664330.695 ± 1828399.646 ops/s | ||
ChainBench.consSmallVector thrpt 25 57686442.030 ± 533768.670 ops/s | ||
ChainBench.createChainOption thrpt 25 167191685.222 ± 1474976.197 ops/s | ||
ChainBench.createChainSeqOption thrpt 25 21264365.364 ± 372757.348 ops/s | ||
ChainBench.createSmallChain thrpt 25 87260308.052 ± 960407.889 ops/s | ||
ChainBench.createSmallList thrpt 25 20000981.857 ± 396001.340 ops/s | ||
ChainBench.createSmallVector thrpt 25 26311376.712 ± 288871.258 ops/s | ||
ChainBench.createTinyChain thrpt 25 75311482.869 ± 1066466.694 ops/s | ||
ChainBench.createTinyList thrpt 25 67502351.990 ± 1071560.419 ops/s | ||
ChainBench.createTinyVector thrpt 25 39676430.380 ± 405717.649 ops/s | ||
ChainBench.foldLeftLargeChain thrpt 25 117.866 ± 3.343 ops/s | ||
ChainBench.foldLeftLargeList thrpt 25 193.640 ± 2.298 ops/s | ||
ChainBench.foldLeftLargeVector thrpt 25 178.370 ± 0.830 ops/s | ||
ChainBench.foldLeftSmallChain thrpt 25 43732934.777 ± 362285.965 ops/s | ||
ChainBench.foldLeftSmallList thrpt 25 51155941.055 ± 882005.961 ops/s | ||
ChainBench.foldLeftSmallVector thrpt 25 41902918.940 ± 53030.742 ops/s | ||
ChainBench.lengthLargeChain thrpt 25 131831.918 ± 1613.341 ops/s | ||
ChainBench.lengthLargeList thrpt 25 271.015 ± 0.962 ops/s | ||
ChainBench.mapLargeChain thrpt 25 78.162 ± 2.620 ops/s | ||
ChainBench.mapLargeList thrpt 25 73.676 ± 8.999 ops/s | ||
ChainBench.mapLargeVector thrpt 25 132.443 ± 2.360 ops/s | ||
ChainBench.mapSmallChain thrpt 25 24047623.583 ± 1834073.508 ops/s | ||
ChainBench.mapSmallList thrpt 25 21482014.328 ± 387854.819 ops/s | ||
ChainBench.mapSmallVector thrpt 25 34707281.383 ± 382477.558 ops/s | ||
ChainBench.reverseLargeChain thrpt 25 37700.549 ± 154.942 ops/s | ||
ChainBench.reverseLargeList thrpt 25 142.832 ± 3.626 ops/s | ||
``` | ||
|
||
You can also create an `Option` of `NonEmptyChain` from a `Chain` or any other collection type: | ||
|
||
```scala mdoc | ||
import cats.data._ | ||
While not dominant, `Chain` performance is in the middle of the pack for most operations benchmarked. | ||
`Chain` does have poor random access performance, and should be avoided in favor of `Vector` for random access heavy use cases. | ||
|
||
NonEmptyChain.fromChain(Chain(1, 2, 3)) | ||
NonEmptyChain.fromSeq(List.empty[Int]) | ||
NonEmptyChain.fromSeq(Vector(1, 2, 3)) | ||
``` | ||
Chain excels with concatenation heavy workloads and has comparable performance to `List` and `Vector` for most other operations. | ||
So next time you write any code that uses `List` or `Vector` as a [Monoid], be sure to use `Chain` instead! | ||
|
||
Sometimes, you'll want to prepend or append a single element to a chain and return the result as a `NonEmptyChain`: | ||
> Note: All benchmarks above were run using JMH 1.32 with Scala 2.13.8 on JDK 11. | ||
For full details, see [here](https://github.com/typelevel/cats/pull/4264). | ||
You can also check out the [benchmark source code](https://github.com/typelevel/cats/blob/v@VERSION@/bench/src/main/scala/cats/bench). | ||
|
||
```scala mdoc | ||
import cats.data._ | ||
|
||
NonEmptyChain.fromChainAppend(Chain(1, 2, 3), 4) | ||
NonEmptyChain.fromChainAppend(Chain.empty[Int], 1) | ||
NonEmptyChain.fromChainPrepend(1, Chain(2, 3)) | ||
``` | ||
[nec]: @API_LINK_BASE@/cats/data/index.html#NonEmptyChain:cats.data.NonEmptyChainImpl.type | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is what the mdoc var is for. Laika can't link to this, but this seemed like a reasonable approximation of linking without hardcoding too much. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Laika's own ${vars} also don't work in links, I tried that first 🤷♂️ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If you have a moment, might be good to open an issue for this in Laika. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how relevant that blog post is for the 2.13 collections
Vector
. Maybe to 2.12, not sure.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be interesting to see how relevant Chain remains after 2.13.11 Vector updates.