Optimize traverse #3283

travisbrown · 2020-02-03T10:08:07Z

tl;dr: Traversing a List or Vector is probably the most common operation people do with this library, and the current implementations for these types have some room for optimization, with the changes in this PR giving up to 20% more throughput for List and 201% for Vector.

I've put together a benchmark that compares the current traverse for List with two new implementations:

def traverseFoldRight[G[_], A, B](fa: List[A])(f: A => G[B])(implicit G: Applicative[G]): G[List[B]] =
  fa.foldRight[Eval[G[List[B]]]](Always(G.pure(Nil))) {
      case (h, t) => G.map2Eval(f(h), Eval.defer(t))(_ :: _)
    }
    .value

def traverseRec[G[_], A, B](fa: List[A])(f: A => G[B])(implicit G: Applicative[G]): G[List[B]] = {
  def loop(fa: List[A]): Eval[G[List[B]]] = fa match {
    case h :: t => G.map2Eval(f(h), Eval.defer(loop(t)))(_ :: _)
    case Nil    => Eval.now(G.pure(Nil))
  }
  loop(fa).value
}

The first uses the standard library's foldRight with an explicit Eval accumulator, instead of the Eval-based foldRight on Foldable. The second effectively just inlines the call the Cat's foldRight in the current implementation.

Both of these seem substantially faster than the current implementation when traversing with Right(_) (results shown for list sizes 10¹, 10², 10³, and 10⁴; higher numbers are better; all results are shown for Scala 2.13, but 2.12 is similar):

Benchmark                                       Mode  Cnt        Score       Error  Units
TraverseListBench.traverseCats1                thrpt   20  2522518.924 ±  4088.634  ops/s
TraverseListBench.traverseCats2                thrpt   20   284154.249 ±  1386.556  ops/s
TraverseListBench.traverseCats3                thrpt   20    26490.162 ±   764.213  ops/s
TraverseListBench.traverseCats4                thrpt   20     2645.683 ±     2.779  ops/s
TraverseListBench.traverseFoldRight1           thrpt   20  3109083.857 ± 10545.898  ops/s
TraverseListBench.traverseFoldRight2           thrpt   20   325352.357 ±   482.879  ops/s
TraverseListBench.traverseFoldRight3           thrpt   20    26009.438 ±    89.164  ops/s
TraverseListBench.traverseFoldRight4           thrpt   20     2609.019 ±    15.222  ops/s
TraverseListBench.traverseRec1                 thrpt   20  3053589.800 ±  8292.173  ops/s
TraverseListBench.traverseRec2                 thrpt   20   340495.016 ±   803.026  ops/s
TraverseListBench.traverseRec3                 thrpt   20    30449.658 ±    58.876  ops/s
TraverseListBench.traverseRec4                 thrpt   20     2945.153 ±     3.059  ops/s

The loop implementation also allocates less:

Benchmark                                                Mode  Cnt        Score        Error   Units
TraverseListBench.traverseCats1:gc.alloc.rate.norm      thrpt    5     2056.000 ±      0.001    B/op
TraverseListBench.traverseCats2:gc.alloc.rate.norm      thrpt    5    18616.000 ±      0.001    B/op
TraverseListBench.traverseCats3:gc.alloc.rate.norm      thrpt    5   198168.002 ±      0.001    B/op
TraverseListBench.traverseCats4:gc.alloc.rate.norm      thrpt    5  1998168.018 ±      0.012    B/op
TraverseListBench.traverseRec1:gc.alloc.rate.norm       thrpt    5     1728.000 ±      0.001    B/op
TraverseListBench.traverseRec2:gc.alloc.rate.norm       thrpt    5    16848.000 ±      0.001    B/op
TraverseListBench.traverseRec3:gc.alloc.rate.norm       thrpt    5   182016.001 ±      0.001    B/op
TraverseListBench.traverseRec4:gc.alloc.rate.norm       thrpt    5  1838016.016 ±      0.009    B/op

The results for a more complex parsing operation in ValidatedNel are similar.

I've done a similar comparison for Vector, but with an additional new candidate:

def traverseIter[G[_], A, B](fa: Vector[A])(f: A => G[B])(implicit G: Applicative[G]): G[Vector[B]] = {
  var i = fa.length - 1
  var current: Eval[G[Vector[B]]] = Eval.now(G.pure(Vector.empty))

  while (i >= 0) {
    current = G.map2Eval(f(fa(i)), current)(_ +: _)
    i -= 1
  }

  current.value
}

I've also included implementations of all three new approaches for Vector that accumulate the result in a List and then convert at the end.


Benchmark                                       Mode  Cnt        Score       Error  Units
TraverseVectorBench.traverseCats1              thrpt   20  1716229.387 ± 12903.229  ops/s
TraverseVectorBench.traverseCats2              thrpt   20    98885.248 ±   187.467  ops/s
TraverseVectorBench.traverseCats3              thrpt   20     8095.486 ±    86.914  ops/s
TraverseVectorBench.traverseCats4              thrpt   20      786.319 ±     9.118  ops/s
TraverseVectorBench.traverseFoldRight1         thrpt   20  1940918.153 ±  4653.590  ops/s
TraverseVectorBench.traverseFoldRight2         thrpt   20    99467.268 ±   151.832  ops/s
TraverseVectorBench.traverseFoldRight3         thrpt   20     7906.035 ±    25.461  ops/s
TraverseVectorBench.traverseFoldRight4         thrpt   20      768.825 ±     4.546  ops/s
TraverseVectorBench.traverseFoldRightViaList1  thrpt   20  2444454.783 ± 27744.679  ops/s
TraverseVectorBench.traverseFoldRightViaList2  thrpt   20   250501.174 ±  1286.555  ops/s
TraverseVectorBench.traverseFoldRightViaList3  thrpt   20    22235.074 ±    55.709  ops/s
TraverseVectorBench.traverseFoldRightViaList4  thrpt   20     2195.451 ±     3.826  ops/s
TraverseVectorBench.traverseIter1              thrpt   20  1845529.178 ±  1799.628  ops/s
TraverseVectorBench.traverseIter2              thrpt   20    98067.794 ±   408.574  ops/s
TraverseVectorBench.traverseIter3              thrpt   20     8032.515 ±    49.259  ops/s
TraverseVectorBench.traverseIter4              thrpt   20      765.116 ±     3.384  ops/s
TraverseVectorBench.traverseIterViaList1       thrpt   20  2409083.473 ±  2141.445  ops/s
TraverseVectorBench.traverseIterViaList2       thrpt   20   255852.261 ±   488.992  ops/s
TraverseVectorBench.traverseIterViaList3       thrpt   20    22926.371 ±   134.168  ops/s
TraverseVectorBench.traverseIterViaList4       thrpt   20     2160.138 ±     2.741  ops/s
TraverseVectorBench.traverseRec1               thrpt   20  1994461.861 ± 12887.488  ops/s
TraverseVectorBench.traverseRec2               thrpt   20   101952.832 ±   233.014  ops/s
TraverseVectorBench.traverseRec3               thrpt   20     8120.346 ±    82.347  ops/s
TraverseVectorBench.traverseRec4               thrpt   20      792.459 ±    28.741  ops/s
TraverseVectorBench.traverseRecViaList1        thrpt   20  2628040.643 ± 35243.584  ops/s
TraverseVectorBench.traverseRecViaList2        thrpt   20   279305.281 ±   381.740  ops/s
TraverseVectorBench.traverseRecViaList3        thrpt   20    24388.417 ±    40.075  ops/s
TraverseVectorBench.traverseRecViaList4        thrpt   20     2728.899 ±     2.560  ops/s

Again the loop implementation is fastest (but the version that accumulates in a list, not the one that builds a vector directly).

I've made these changes for List, Vector, and 2.13's ArraySeq, but not for Chain, Stream, or LazyList.

codecov-io · 2020-02-03T11:00:15Z

Codecov Report

Merging #3283 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3283      +/-   ##
==========================================
- Coverage   93.14%   93.14%   -0.01%     
==========================================
  Files         378      378              
  Lines        7576     7575       -1     
  Branches      203      194       -9     
==========================================
- Hits         7057     7056       -1     
  Misses        519      519

Flag	Coverage Δ
#scala_version_212	`93.39% <100%> (-0.01%)`	⬇️
#scala_version_213	`92.91% <100%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
core/src/main/scala/cats/instances/list.scala	`100% <100%> (ø)`	⬆️
...src/main/scala-2.13+/cats/instances/arraySeq.scala	`100% <100%> (ø)`	⬆️
core/src/main/scala/cats/instances/vector.scala	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb7a180...e5e2968. Read the comment docs.

LukaJCB

Nice, thank you!

kailuowang

Thanks. I am now curious, are there other opportunities for optimization by inlining foldRight?

djspiewak

At first glance, seems like we could potentially shave even more off of this with some dirtier internal tricks. This is a great start though.

Performance improvements on generic typeclass operations generally don't particularly interest me, since they should never be in the hot path anyway, but faster is always better than slower, regardless of the context.

travisbrown added 2 commits February 3, 2020 03:45

Add benchmark for traverse implementations

2096d7c

Optimize traverse for List, Vector, and ArraySeq

e5e2968

LukaJCB approved these changes Feb 3, 2020

View reviewed changes

kailuowang approved these changes Feb 3, 2020

View reviewed changes

djspiewak approved these changes Feb 3, 2020

View reviewed changes

djspiewak merged commit 56c1527 into typelevel:master Feb 3, 2020

travisbrown added this to the 2.2.0-M1 milestone Feb 18, 2020

travisbrown added the enhancement label Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize traverse #3283

Optimize traverse #3283

travisbrown commented Feb 3, 2020 •

edited

Loading

codecov-io commented Feb 3, 2020 •

edited

Loading

LukaJCB left a comment

kailuowang left a comment

djspiewak left a comment

Optimize traverse #3283

Optimize traverse #3283

Conversation

travisbrown commented Feb 3, 2020 • edited Loading

codecov-io commented Feb 3, 2020 • edited Loading

Codecov Report

LukaJCB left a comment

Choose a reason for hiding this comment

kailuowang left a comment

Choose a reason for hiding this comment

djspiewak left a comment

Choose a reason for hiding this comment

travisbrown commented Feb 3, 2020 •

edited

Loading

codecov-io commented Feb 3, 2020 •

edited

Loading