tracing: ensmallerate assembly generated by macro expansions #943

hawkw · 2020-08-18T23:47:46Z

Motivation

Currently, tracing's macros generate a lot of code at the expansion
site, most of which is not hidden behind function calls. Additionally,
several functions called in the generated code are inlined when they
probably shouldn't be. This results in a really gross disassembly for
code that calls a tracing macro.

Inlining code can be faster, by avoiding function calls. However, so
much code can have some negative impacts. Obviously, the output binary
is much larger. Perhaps more importantly, though, this code may optimize
much less well --- for example, functions that contain tracing macros
are unlikely to be inlined, since they are quite large.

Solution

This branch "outlines" most of the macro code, with the exception of the
two hottest paths for skipping a disabled callsite (the global max level
and the per-callsite cache). Actually recording the span or event,
performing runtime filtering via Subscriber::enabled, and registering
the callsite for the first time, are behind a function call. This means
that the fast paths still don't require a stack frame, but the amount of
code they jump past is significantly shorter.

Also, while working on this, I noticed that the event macro expansions
were hitting the dispatcher thread-local two separate times for enabled
events. That's not super great. I folded these together into one
LocalKey::with call, so that we don't access the thread-local twice.

Building ValueSets to actually record a span or event is the path that
generates the most code at the callsite currently. In order to put this
behind a function call, it was necessary to capture the local variables
that comprise the ValueSet using a closure. While closures have
impacted build times in the past, my guess is that this may be offset a
bit by generating way less code overall. It was also necessary to add
new dispatcher-accessing functions that take a FnOnce rather than a
FnMut. This is because fields are sometimes moved into the
ValueSet by value rather than by ref, when they are inside of a call
to tracing::debug or display that doesn't itself borrow the value.
Not supporting this would be a breaking change, so we had to make it
work. Capturing a FnOnce into the FnMut in an &mut Option seemed
inefficient, so I did this instead.

Before & After

Here's the disassembly of the following function:

#[inline(never)]
pub fn event() {
    tracing::info!("hello world");
}

On master:

event:
; pub fn event() {
               	push	r15
               	push	r14
               	push	r12
               	push	rbx
               	sub	rsp, 184
;         Relaxed => intrinsics::atomic_load_relaxed(dst),
               	lea	rax, [rip + 227515]
               	mov	rax, qword ptr [rax]
;             Self::ERROR_USIZE => Self::ERROR,
               	add	rax, -3
               	cmp	rax, 3
               	jb	395 <event+0x1b1>
               	lea	rbx, [rip + 227427]
               	mov	qword ptr [rsp + 96], rbx
;         Acquire => intrinsics::atomic_load_acq(dst),
               	mov	rax, qword ptr [rip + 227431]
;         self.state_and_queue.load(Ordering::Acquire) == COMPLETE
               	cmp	rax, 3
;         if self.is_completed() {
               	jne	36 <event+0x63>
               	mov	qword ptr [rsp], rbx
;         Relaxed => intrinsics::atomic_load_relaxed(dst),
               	mov	rax, qword ptr [rip + 227398]
;             if interest.is_always() {
               	test	rax, -3
               	je	86 <event+0xa8>
               	mov	rdi, rsp
;             crate::dispatcher::get_default(|current| current.enabled(self.meta))
               	call	-922 <tracing_core::dispatcher::get_default::h8c0486e541dd0400>
;     tracing::info!("hello world");
               	test	al, al
               	jne	84 <event+0xb2>
               	jmp	334 <event+0x1b1>
               	lea	rax, [rsp + 96]
;         let mut f = Some(f);
               	mov	qword ptr [rsp + 136], rax
               	lea	rax, [rsp + 136]
;         self.call_inner(false, &mut |_| f.take().unwrap()());
               	mov	qword ptr [rsp], rax
               	lea	rdi, [rip + 227357]
               	lea	rcx, [rip + 219174]
               	mov	rdx, rsp
               	xor	esi, esi
               	call	qword ptr [rip + 226083]
               	mov	qword ptr [rsp], rbx
;         Relaxed => intrinsics::atomic_load_relaxed(dst),
               	mov	rax, qword ptr [rip + 227312]
;             if interest.is_always() {
               	test	rax, -3
               	jne	-86 <event+0x52>
;                 0 => Interest::never(),
               	cmp	rax, 2
;     tracing::info!("hello world");
               	jne	255 <event+0x1b1>
;             &self.meta
               	mov	rbx, qword ptr [rip + 227295]
;     tracing::info!("hello world");
               	lea	r14, [rip + 2528]
               	mov	rdi, rbx
               	call	r14
               	lea	r12, [rsp + 136]
               	mov	rdi, r12
               	mov	rsi, rax
               	call	qword ptr [rip + 225926]
               	mov	rdi, rbx
               	call	r14
               	mov	r14, rax
               	mov	r15, rsp
               	mov	rdi, r15
               	mov	rsi, r12
               	call	qword ptr [rip + 225926]
;             Some(val) => val,
               	cmp	qword ptr [rsp + 8], 0
               	je	194 <event+0x1c0>
               	mov	rax, qword ptr [rsp + 32]
               	mov	qword ptr [rsp + 128], rax
               	movups	xmm0, xmmword ptr [rsp]
               	movups	xmm1, xmmword ptr [rsp + 16]
               	movaps	xmmword ptr [rsp + 112], xmm1
               	movaps	xmmword ptr [rsp + 96], xmm0
;         Arguments { pieces, fmt: None, args }
               	lea	rax, [rip + 218619]
               	mov	qword ptr [rsp], rax
               	mov	qword ptr [rsp + 8], 1
               	mov	qword ptr [rsp + 16], 0
               	lea	rax, [rip + 165886]
               	mov	qword ptr [rsp + 32], rax
               	mov	qword ptr [rsp + 40], 0
               	lea	rax, [rsp + 96]
;     tracing::info!("hello world");
               	mov	qword ptr [rsp + 72], rax
               	mov	qword ptr [rsp + 80], r15
               	lea	rax, [rip + 218570]
               	mov	qword ptr [rsp + 88], rax
               	lea	rax, [rsp + 72]
;         ValueSet {
               	mov	qword ptr [rsp + 48], rax
               	mov	qword ptr [rsp + 56], 1
               	mov	qword ptr [rsp + 64], r14
               	lea	rax, [rsp + 48]
;         Event {
               	mov	qword ptr [rsp + 136], rax
               	mov	qword ptr [rsp + 144], rbx
               	mov	qword ptr [rsp + 152], 1
               	lea	rdi, [rsp + 136]
;         crate::dispatcher::get_default(|current| {
               	call	-1777 <tracing_core::dispatcher::get_default::h32aa5a3d270b4ae2>
; }
               	add	rsp, 184
               	pop	rbx
               	pop	r12
               	pop	r14
               	pop	r15
               	ret
;             None => expect_failed(msg),
               	lea	rdi, [rip + 165696]
               	lea	rdx, [rip + 218426]
               	mov	esi, 34
               	call	qword ptr [rip + 226383]
               	ud2

And on this branch:

event:
; pub fn event() {
               	sub	rsp, 24
;         Relaxed => intrinsics::atomic_load_relaxed(dst),
               	lea	rax, [rip + 219189]
               	mov	rax, qword ptr [rax]
;             Self::ERROR_USIZE => Self::ERROR,
               	add	rax, -3
               	cmp	rax, 3
               	jb	59 <event+0x53>
;         Relaxed => intrinsics::atomic_load_relaxed(dst),
               	mov	rcx, qword ptr [rip + 219105]
;                 0 => Interest::never(),
               	test	rcx, rcx
               	je	47 <event+0x53>
               	mov	al, 1
               	cmp	rcx, 1
               	je	8 <event+0x34>
               	cmp	rcx, 2
               	jne	38 <event+0x58>
               	mov	al, 2
               	lea	rcx, [rip + 219077]
               	mov	qword ptr [rsp + 16], rcx
               	mov	byte ptr [rsp + 15], al
               	lea	rdi, [rsp + 15]
               	lea	rsi, [rsp + 16]
;             tracing_core::dispatcher::get_current(|current| {
               	call	-931 <tracing_core::dispatcher::get_current::he31e3ce09e66e5fa>
; }
               	add	rsp, 24
               	ret
;                 _ => self.register(),
               	lea	rdi, [rip + 219041]
               	call	qword ptr [rip + 218187]
;             InterestKind::Never => true,
               	test	al, al
;     tracing::info!("hello world");
               	jne	-53 <event+0x34>
               	jmp	-24 <event+0x53>
               	nop	dword ptr [rax + rax]

So, uh...that seems better.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

Mark-Simulacrum · 2020-08-20T01:36:08Z

Are you piping the assembly through sed or something like that? It looks... weird in some places, e.g. [rsp, +, 15] has commas in the middle...

But this looks great. Let me know if you want to get this run by perf.rlo, I am happy to help out with that.

hawkw · 2020-08-20T04:10:30Z

Are you piping the assembly through sed or something like that? It looks... weird in some places, e.g. [rsp, +, 15] has commas in the middle...

But this looks great. Let me know if you want to get this run by perf.rlo, I am happy to help out with that.

This is from cargo asm, its formatter is...not great.

iximeow · 2020-08-20T04:12:16Z

yeah. there's some "good" reason cargo asm does that (i've looked into it and decided fixing it was more effort than i could put in at the time) but it's well in "plumbing data from the right apis" territory

hawkw · 2020-08-20T04:18:08Z

Clearly I should have just used objdump. This saved me from having to a recompile in a separate command, and I didn't notice the commas until you pointed them out. Now it's driving me crazy. :)

hawkw · 2020-08-20T04:20:54Z

Let me know if you want to get this run by perf.rlo, I am happy to help out with that.

A perf run would be great. Our microbenchmarks were not terribly helpful here, and I imagine the impact is a lot more noticeable when there's actual non-tracing code involved.

Cargo.toml

hawkw · 2020-08-20T16:56:55Z

tracing-core/src/dispatcher.rs

+///
+/// [dispatcher]: ../dispatcher/struct.Dispatch.html
+#[cfg(feature = "std")]
+#[doc(hidden)]


Adding this was necessary since constructing the ValueSet in a closure means we needed a FnOnce so that values can be moved into the closure (which happens primarily when users wrap values in field::display/field::debug without borrowing). Not supporting this would break at least some code. I opted to add a new function that takes a FnOnce, rather than capturing an Option<FnOnce> into the FnMut and takeing it, because that seemed significantly less efficient.

We could probably commit to making this a public API (and even deprecate get_default, the version that takes a FnMut). This is probably a more useful API than get_default taking a FnMut... I didn't make it public here, though, because introducing a new public API seemed like a job for another PR.

Also, in 0.2, the Value changes will make Value::debug and Value::display also borrow rather than move. So, the FnOnce will no longer be necessary, and we could, alternatively, remove this function in 0.2, rather than deprecating the existing get_current.

So, there should probably be an actual design discussion... :)

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw · 2020-08-20T17:41:00Z

I replaced the cargo asm assembly with llvm-objdump disassembly (with source lines! yay!) because everyone whined about cargo asm's bad formatter. :P

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

LucioFranco

LGTM! This was a really neat PR to read :) left a nit and a question but nothing blocking.

tracing/src/lib.rs

LucioFranco · 2020-08-21T19:46:34Z

tracing/src/lib.rs

+        pub fn register(&'static self) -> Interest {
+            self.registration
+                .call_once(|| crate::callsite::register(self));
+            match self.interest.load(Ordering::Relaxed) {


Is it actually worse here to call the inlined interest fn here?

It would be fine for it to be inlined into this function, since this function shouldn't be inlined. The reason it doesn't is that the interest function calls register (this function) if the cached value is not 0, 1, or 2. That means that if we register the callsite and the cached interest value is still not one of those, we could recurse again, potentially infinitely.

This shouldn't happen ever, but if there's a bug elsewhere, I'd rather just assume the cached value is invalid and return Interest::sometimes() (which is what this method does), rather than overflowing the stack.

Co-authored-by: Lucio Franco <luciofranco14@gmail.com>

Fixed - When combining `Interest` from multiple subscribers, if the interests differ, the current subscriber is now always asked if a callsite should be enabled (#927) Added - Internal API changes to support optimizations in the `tracing` crate (#943) - **docs**: Multiple fixes and improvements (#913, #941)

### Fixed - When combining `Interest` from multiple subscribers, if the interests differ, the current subscriber is now always asked if a callsite should be enabled (#927) ### Added - Internal API changes to support optimizations in the `tracing` crate (#943) - **docs**: Multiple fixes and improvements (#913, #941)

Changed - Significantly reduced assembly generated by macro invocations (#943) - Updated `tracing-core` to 0.1.15 (#943) Added - Documented minimum supported Rust version policy (#941)

### Changed - Significantly reduced assembly generated by macro invocations (#943) - Updated `tracing-core` to 0.1.15 (#943) ### Added - Documented minimum supported Rust version policy (#941)

…943)"

## Motivation In PR #943, the construction of `ValueSet`s for events and spans was moved out of the code expanded at the callsite and into a closure, in order to reduce the amount of assembly generated in functions containing tracing macros. However, this introduced an accidental breaking change for some dependent crates. Borrowing values inside a closure meant that when a field of a struct, array, or tuple was used as a field value, the closure must borrow the _entire_ struct, array, or tuple, rather than the individual field. This broke code where another unrelated field of that struct, array, or tuple would then be mutably borrowed or moved elsewhere. ## Solution This branch fixes the breaking change by moving `ValueSet` construction back out of a closure and into the code expanded at the callsite. This _does_ regress the amount of assembly generated a bit: a function containing a single `event!` macro generates 32 instructions in release mode on master, and after this change, it generates 83 instructions. However, this is better than reverting PR #943 entirely, which generates 103 instructions for the same function. This change allows us to continue to benefit from *some* of the changes made in #943, although we no longer can benefit from the most significant one. Rather than trying to further optimize the macro expanded code now, I think we should wait for the `ValueSet` rework that will land in `tracing` 0.2, where we could potentially generate significantly less code by virtue of constructing a `ValueSet` with a much simpler array literal (no `FieldSet` iteration required). Fixes #954 Closes #986 Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw added 4 commits August 18, 2020 16:31

[WIP] shrink code inlined by macro expansions

f810039

wip

540a158

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

okay spans actually work now

182d1b4

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

shrink span macros some more

1d31fe1

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw changed the title ~~[WIP] shrink code inlined by macro expansions~~ tracing: ensmallerate assembly generated by macro expansions Aug 20, 2020

hawkw marked this pull request as ready for review August 20, 2020 01:22

hawkw requested a review from a team as a code owner August 20, 2020 01:22

bjorn3 reviewed Aug 20, 2020

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

hawkw commented Aug 20, 2020

View reviewed changes

hawkw added 3 commits August 20, 2020 10:10

deduplicate some internals

283dcb9

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

fix whitespace

4a52a12

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

fix nostd

ec43888

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

bump core version (tracing depends on new APIs)

9d458e2

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

LucioFranco approved these changes Aug 21, 2020

View reviewed changes

Update tracing/src/lib.rs

248c7d2

Co-authored-by: Lucio Franco <luciofranco14@gmail.com>

hawkw merged commit 2692560 into master Aug 21, 2020

hawkw mentioned this pull request Aug 21, 2020

core: prepare to release 0.1.15 #951

Merged

hawkw mentioned this pull request Aug 24, 2020

tracing: prepare to release 0.1.20 #952

Merged

hawkw mentioned this pull request Aug 24, 2020

v0.1.20 causes issues with Tonic #954

Closed

davidbarsky added a commit that referenced this pull request Sep 24, 2020

Revert "tracing: ensmallerate assembly generated by macro expansions (#…

38a98b5

…943)"

hawkw mentioned this pull request Sep 25, 2020

tracing: move ValueSet construction out of closures #987

Merged

hawkw mentioned this pull request Jun 8, 2021

Proposal: optimize for binary size #1425

Open

pkoenig10 mentioned this pull request Oct 18, 2024

Simplify callsite enabled check #3112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracing: ensmallerate assembly generated by macro expansions #943

tracing: ensmallerate assembly generated by macro expansions #943

hawkw commented Aug 18, 2020 •

edited

Loading

Mark-Simulacrum commented Aug 20, 2020

hawkw commented Aug 20, 2020 •

edited

Loading

iximeow commented Aug 20, 2020

hawkw commented Aug 20, 2020

hawkw commented Aug 20, 2020

hawkw Aug 20, 2020

hawkw commented Aug 20, 2020

LucioFranco left a comment

LucioFranco Aug 21, 2020

hawkw Aug 21, 2020

tracing: ensmallerate assembly generated by macro expansions #943

tracing: ensmallerate assembly generated by macro expansions #943

Conversation

hawkw commented Aug 18, 2020 • edited Loading

Motivation

Solution

Before & After

Mark-Simulacrum commented Aug 20, 2020

hawkw commented Aug 20, 2020 • edited Loading

iximeow commented Aug 20, 2020

hawkw commented Aug 20, 2020

hawkw commented Aug 20, 2020

hawkw Aug 20, 2020

Choose a reason for hiding this comment

hawkw commented Aug 20, 2020

LucioFranco left a comment

Choose a reason for hiding this comment

LucioFranco Aug 21, 2020

Choose a reason for hiding this comment

hawkw Aug 21, 2020

Choose a reason for hiding this comment

hawkw commented Aug 18, 2020 •

edited

Loading

hawkw commented Aug 20, 2020 •

edited

Loading