Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IOLocal propagation for unsafe access #3636

Merged
merged 39 commits into from
Nov 22, 2024

Conversation

armanbilge
Copy link
Member

Still needs quite a bit of work, but wanted to sketch the basic idea.

Goal: to expose a fiber's IOLocals as ThreadLocals within a side-effecting block, so that they can be accessed and modified in unsafe land.

Motivation: basically telemetry Java interop.

Constraints: to do this as safely as possible 😅

@armanbilge armanbilge marked this pull request as draft May 16, 2023 19:35
@@ -43,4 +43,6 @@ final class IOFiberConstants {
static final byte CedeR = 6;
static final byte AutoCedeR = 7;
static final byte DoneR = 8;

static final boolean dumpLocals = Boolean.getBoolean("cats.effect.tracing.dumpLocals");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikesheddable configuration for opting-in. So the rest of us don't have to pay the penalty 😇

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specifically "tracing", even if that's the most obvious use case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh woops, this was a very lazy copy-pasta. I copied it from the system properties we use to configure fiber tracing. We should rename it anyway, dumpLocals is not quite right I think 😅

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about cats.effect.localContextPropagation similar to Monix's monix.environment.localContextPropagation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I liked that! I went with cats.effect.ioLocalPropagation.

Comment on lines 253 to 257
var locals: IOLocals = null
if (dumpLocals) {
locals = new IOLocals(localState)
IOLocals.threadLocal.set(locals)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this just for delay, but I guess blocking and interruptible would want it too.

Comment on lines 4 to 5
// TODO handle defaults and lenses. all do-able, just needs refactoring ...
final class IOLocals private[effect] (private[this] var state: IOLocalState) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I was extremely lazy with implementing this thing. But the main idea of this wrapper API is that it should only give the user access to IOLocals that they know about i.e. they should not able to clear out other locals that happen to be present.

Comment on lines 1011 to 1012
val fiber = new IOFiber[A](
Map.empty,
if (IOFiberConstants.dumpLocals) unsafe.IOLocals.getState else Map.empty,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can even go in the opposite direction for IO#unsafeRun* 😁

It's less clear if/how to do this for fibers started in a Dispatcher, since they should be inheriting locals from the fiber backing the Dispatcher.

Copy link
Member

@rossabaker rossabaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I will try to give a snapshot of this a shot over on the otel4s side.

@@ -43,4 +43,6 @@ final class IOFiberConstants {
static final byte CedeR = 6;
static final byte AutoCedeR = 7;
static final byte DoneR = 8;

static final boolean dumpLocals = Boolean.getBoolean("cats.effect.tracing.dumpLocals");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specifically "tracing", even if that's the most obvious use case?

rossabaker added a commit to typelevel/otel4s that referenced this pull request May 18, 2023
Comment on lines 19 to 20
// defined in Java since Scala doesn't let us define static fields
final class IOFiberConstants {
public final class IOFiberConstants {
Copy link
Member Author

@armanbilge armanbilge May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good. To avoid this we'll either have to replicate it at both the cats.effect and cats.effect.unsafe levels, or move the thread-local IOLocals accessors into cats.effect.

@kevin-lee
Copy link

I apologise for jumping in and asking this question but, can it be used as an alternative to Local from Monix?
I need it specifically for logging, much like what's explained in the post, "Better logging with Monix 3, part 1: MDC" but with cats-effect 3. This has been a blocker for my company's transition from cats-effect 2 with Monix to cats-effect 3. The issue has become increasingly critical as more and more libraries cease to support cats-effect 2, choosing to support only cats-effect 3 instead. So it would be great to see if there's any upcoming solution.

@armanbilge
Copy link
Member Author

@kevin-lee no problem! At a glance, yes, this does look like an alternative/replacement to that Monix feature. Perhaps @alexandru can confirm :)

@armanbilge armanbilge changed the base branch from series/3.5.x to series/3.x June 27, 2023 00:28
@armanbilge armanbilge marked this pull request as ready for review June 27, 2023 00:29
@armanbilge armanbilge changed the title Proof-of-concept thread-local IOLocals thread-local IOLocals Jun 27, 2023
@kevin-lee
Copy link

@kevin-lee no problem! At a glance, yes, this does look like an alternative/replacement to that Monix feature. Perhaps @alexandru can confirm :)

@armanbilge Thank you. That's great!
It looks quite similar and looks like it can be used for the same purpose, but I can see the methods in IOLocals taking IOLocal[A] whereas the methods in Local from Monix don't. So I'm wondering if it can be used for the same purpose. I need to use it in a logging library like Logback, which doesn't have IO or any effect.
Yeah, it would be nice if @alexandru can confirm this.

@armanbilge
Copy link
Member Author

armanbilge commented Jun 28, 2023

but I can see the methods in IOLocals taking IOLocal[A] whereas the methods in Local from Monix don't. So I'm wondering if it can be used for the same purpose. I need to use it in a logging library like Logback, which doesn't have IO or any effect.

@kevin-lee the IOLocal[A] is just the "key". Once you have that key (and you can build one unsafely outside of the effect) then you don't need any more effects to read and write it.

For an example integration, see this PR which implements a Java SPI using this IOLocals API.

@kevin-lee
Copy link

@kevin-lee the IOLocal[A] is just the "key". Once you have that key (and you can build one unsafely outside of the effect) then you don't need any more effects to read and write it.

For an example integration, see this PR which implements a Java SPI using this IOLocals API.

@armanbilge Oh... got it. It looks very promising then. Thank you!

@armanbilge armanbilge marked this pull request as draft June 4, 2024 20:56
@armanbilge armanbilge marked this pull request as ready for review June 5, 2024 00:34
@djspiewak
Copy link
Member

Finished benchmarking. @armanbilge let's see your armchair performance reasoning explain this result. I'm at a loss.

Before

[info] Benchmark                                             (cpuTokens)   (size)   Mode  Cnt      Score     Error    Units
[info] DeepBindBenchmark.async                                       N/A    10000  thrpt   10   2786.655 ±   3.536    ops/s
[info] DeepBindBenchmark.delay                                       N/A    10000  thrpt   10   9613.496 ± 234.106    ops/s
[info] DeepBindBenchmark.pure                                        N/A    10000  thrpt   10  11168.222 ± 465.787    ops/s
[info] MapCallsBenchmark.batch120                                    N/A      N/A  thrpt   10    338.430 ±   3.116    ops/s
[info] MapCallsBenchmark.batch30                                     N/A      N/A  thrpt   10     86.709 ±   0.563    ops/s
[info] MapCallsBenchmark.one                                         N/A      N/A  thrpt   10      2.944 ±   0.017    ops/s
[info] MapStreamBenchmark.batch120                                   N/A      N/A  thrpt   10   5561.435 ±  14.217    ops/s
[info] MapStreamBenchmark.batch30                                    N/A      N/A  thrpt   10   2524.134 ±   2.286    ops/s
[info] MapStreamBenchmark.one                                        N/A      N/A  thrpt   10   3233.048 ±   4.027    ops/s
[info] ParallelBenchmark.parTraverse                               10000     1000  thrpt   10    886.610 ±   0.865    ops/s
[info] ParallelBenchmark.traverse                                  10000     1000  thrpt   10     70.460 ±   0.067    ops/s
[info] ShallowBindBenchmark.async                                    N/A    10000  thrpt   10   2031.729 ±   3.593    ops/s
[info] ShallowBindBenchmark.delay                                    N/A    10000  thrpt   10   9778.303 ±  62.426    ops/s
[info] ShallowBindBenchmark.pure                                     N/A    10000  thrpt   10  11973.770 ±  32.027    ops/s
[info] WorkStealingBenchmark.alloc                                   N/A  1000000  thrpt   10     14.152 ±   0.090  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark          N/A  1000000  thrpt   10     31.868 ±   5.072  ops/min
[info] WorkStealingBenchmark.runnableScheduling                      N/A  1000000  thrpt   10    870.238 ±   3.475  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal           N/A  1000000  thrpt   10   2245.542 ±   7.645  ops/min
[info] WorkStealingBenchmark.scheduling                              N/A  1000000  thrpt   10     29.631 ±   2.092  ops/min

After

[info] Benchmark                                             (cpuTokens)   (size)   Mode  Cnt      Score     Error    Units
[info] DeepBindBenchmark.async                                       N/A    10000  thrpt   10   2814.793 ±   6.982    ops/s
[info] DeepBindBenchmark.delay                                       N/A    10000  thrpt   10   9589.880 ±  27.294    ops/s
[info] DeepBindBenchmark.pure                                        N/A    10000  thrpt   10  11377.434 ±  37.080    ops/s
[info] MapCallsBenchmark.batch120                                    N/A      N/A  thrpt   10    336.483 ±   2.160    ops/s
[info] MapCallsBenchmark.batch30                                     N/A      N/A  thrpt   10     85.833 ±   0.404    ops/s
[info] MapCallsBenchmark.one                                         N/A      N/A  thrpt   10      2.925 ±   0.007    ops/s
[info] MapStreamBenchmark.batch120                                   N/A      N/A  thrpt   10   5684.510 ±   8.954    ops/s
[info] MapStreamBenchmark.batch30                                    N/A      N/A  thrpt   10   2522.568 ±   7.291    ops/s
[info] MapStreamBenchmark.one                                        N/A      N/A  thrpt   10   3284.312 ±   2.895    ops/s
[info] ParallelBenchmark.parTraverse                               10000     1000  thrpt   10    891.486 ±   0.846    ops/s
[info] ParallelBenchmark.traverse                                  10000     1000  thrpt   10     70.303 ±   0.060    ops/s
[info] ShallowBindBenchmark.async                                    N/A    10000  thrpt   10   1985.424 ±   2.771    ops/s
[info] ShallowBindBenchmark.delay                                    N/A    10000  thrpt   10   9655.691 ± 182.608    ops/s
[info] ShallowBindBenchmark.pure                                     N/A    10000  thrpt   10  10075.531 ±  22.884    ops/s
[info] WorkStealingBenchmark.alloc                                   N/A  1000000  thrpt   10     14.044 ±   0.078  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark          N/A  1000000  thrpt   10     48.611 ±   1.642  ops/min
[info] WorkStealingBenchmark.runnableScheduling                      N/A  1000000  thrpt   10   2987.595 ±   5.299  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal           N/A  1000000  thrpt   10   2249.480 ±  48.568  ops/min
[info] WorkStealingBenchmark.scheduling                              N/A  1000000  thrpt   10     52.240 ±   3.133  ops/min

@djspiewak
Copy link
Member

Finally scraped together the time needed to shave a few dev environment yaks and test this PR with propagation enabled. The performance hit is about 25% on microbenchmarks involving delay or map, and (as expected) ~0% (within margin of error) on everything else. The bizarre acceleration of WSTP scheduling on non-parTraverse benchmarks remains consistent (notably, parTraverse itself was around 0%).

So I think the summary of this PR, from a performance standpoint, is the following:

  • When disabled, no meaningful impact.
    • Does seem to spookily effect the WSTP microbenchmarks in a way which resets the baseline
  • When enabled, 25% worse on delay/map microbenchmarks
    • No effect within margin of error on scheduling or traverse-y benchmarks
    • WSTP microbenchmark baseline effect remains and is of the same magnitude

So enabling this is analogous to enabling cached tracing, but of course would be cumulative with that flag as well, so technically turning both of them on would bop your microbenchmark scores by something like 55%. It's very unclear whether this type of performance impact matters in practice. We would probably need to run some end-to-end scale tests with propagation enabled in order to see.

Given that this is default-off and the impact when disabled is zero (outside of the still bizarre WSTP microbenchmark mirages), I'm comfortable merging this as-is and we can do some further testing. We should also probably document (or at least, write a ticket to document) the performance effects so it doesn't surprise anyone.

@armanbilge
Copy link
Member Author

I added some documentation.

The performance hit is about 25% on microbenchmarks involving delay or map, and (as expected) ~0% (within margin of error) on everything else.

Discussed on Discord, but to clarify there is no reason to expect that delay or map should be impacted differently, because the propagation logic runs when the fiber is scheduled/descheduled, not for specific ops. (Although note that my initial implementation ran the propagation logic ran before/after delay blocks, in which case the performance discrepancy would be expected.)

@henricook
Copy link

@djspiewak Please forgive my impetuousness - do you have a guesstimate of when this might be going in please?

djspiewak
djspiewak previously approved these changes Nov 21, 2024
@henricook
Copy link

Oh-em-gee. Exciting.

@djspiewak
Copy link
Member

@armanbilge Conflict resolution time!

@armanbilge armanbilge merged commit 8091026 into typelevel:series/3.x Nov 22, 2024
29 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants