Improve `traverse` performance through increased specialization #4408

djspiewak · 2023-03-06T17:40:58Z

This is a follow up on #3790, but breaking it out as a separate issue since we've realized that we can do better than what was originally suggested and discussed. Leverages the findings in #4403. Also relates to #3993.

Altogether, traverse has three distinct performance scenarios, depending on the nature of the underlying applicative G and the underlying traverse F:

If the applicative is secretly a StackSafeMonad, the fastest (by about 3x) strategy is to use the direct left-to-right approach with flatMap. There are no stack-safety issues, and any underlying short-circuiting is handled by the monad itself
If the applicative has a Defer, then we still need to do the tree branching strategy in order to avoid possible stack issues (just because an applicative forms a Defer does not mean its ap is stack safe!), but we no longer need to introduce the extra layer of Eval wrapping in order to guarantee short-circuiting
Otherwise, we need to do tree branching and we need to do Eval wrapping

The first scenario corresponds to doing a traverse with IO or Eval, which are both StackSafeMonads. In this case, the direct flatMap is the fastest thing you can do (verified by checking a hand-coded version). The second scenario corresponds to doing a parTraverse with IO, where the G applicative forms a Defer but does not form a Monad. In this case, we obviously can't use flatMap, but the tree branching strategy is optimal anyway because we want to balance the parallel scheduling. We can rely on the G itself to handle short-circuiting (to the extent that we get it within par) because of Defer. The third scenario corresponds to doing a traverse with something like Option, where we aren't stack safe and we can't form a Defer, so we just need to do the pessimistic tree branching (for stack safety) and Eval wrapping (for short circuiting).

Right now, all Seq-based traversals delegate to traverseViaChain, which strictly implements the third (most pessimistic) scenario. What we want here are two additional implementations corresponding to the first and second scenarios (the second is very close to the third, just without all the Eval.laters), and the swapping between them will be based on runtime type casing on the Applicative[G] instance.

The text was updated successfully, but these errors were encountered:

armanbilge · 2023-03-06T17:50:12Z

What about the case where the applicative is secretly a Monad, albeit not a stack-safe one? Wouldn't delegating to tailRecM allow us to avoid both the tree-branching and the Eval?

djspiewak · 2023-03-07T14:51:42Z

@armanbilge I haven't benchmarked that option, but I would imagine it's not that much faster than the tree + Eval implementation. tailRecM forces quite a bit of boxing, not quite as much as Eval, but close. It would be a fourth distinct branch of this set of redundant implementations, so it would really need to carry its own weight if we went this route.

armanbilge · 2023-03-08T00:51:56Z

I guess we'll have to measure it :) it avoids deconstructing the collection into a tree, which is already a win, and then on top of that it is eagerly-evaluated instead of leaning on Eval for lazy evaluation. I suspect this is more amenable to optimization.

satorg · 2023-03-08T01:14:09Z

Sorry for asking if it has been discussed somewhere else already, but I'm just wondering:

If the applicative is secretly a StackSafeMonad ...

Does it have to be secretly a StackSafeMonad? Can it be expressed as a compile-time constraint? I think, that is what the whole idea of type classes is for, after all.

djspiewak · 2023-03-08T03:21:27Z

The problem is that this would require additional overloads of the traverse method on both the typeclass and in syntax. Additionally, the constraint may not have survived through compile time because most code is written to the least restrictive constraint, not the most restrictive. Finally, doing it this way would mean that changes to constraints would have vast but invisible implications on performance.

It's essentially the same thought process we went through with the async Queue optimization in CE, and we concluded that runtime typecase respecialization was far and away the least evil approach and produced the best experience for users.

The only alternative I can think of in this case would be Oscar's Apply.Strategy approach. This is kind of the same as typecasing, but more explicit for this scenario. I'm definitely open to it if we think it's preferable.

armanbilge mentioned this issue Mar 26, 2023

Chunk - Add optimised traverse_ for StackSafeMonad. typelevel/fs2#3191

Merged

This was referenced Apr 20, 2023

Follow up #3428 when cats optimizes traverse typelevel/cats-effect#3545

Open

Add more MonoidK based folds #4432

Open

This was referenced Aug 22, 2023

Optimize traverse for StackSafeMonads TimWSpence/cats#1

Closed

Optimize traverse #4498

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `traverse` performance through increased specialization #4408

Improve `traverse` performance through increased specialization #4408

djspiewak commented Mar 6, 2023 •

edited

Loading

armanbilge commented Mar 6, 2023

djspiewak commented Mar 7, 2023

armanbilge commented Mar 8, 2023

satorg commented Mar 8, 2023

djspiewak commented Mar 8, 2023 •

edited

Loading

Improve traverse performance through increased specialization #4408

Improve traverse performance through increased specialization #4408

Comments

djspiewak commented Mar 6, 2023 • edited Loading

armanbilge commented Mar 6, 2023

djspiewak commented Mar 7, 2023

armanbilge commented Mar 8, 2023

satorg commented Mar 8, 2023

djspiewak commented Mar 8, 2023 • edited Loading

Improve `traverse` performance through increased specialization #4408

Improve `traverse` performance through increased specialization #4408

djspiewak commented Mar 6, 2023 •

edited

Loading

djspiewak commented Mar 8, 2023 •

edited

Loading