Access to thread locals isn't inlined across crates #25088

veddan · 2015-05-04T05:32:28Z

Proposed solution

Right now access to thread locals defined by thread_local! aren't inlined across crates, causing performance problems that wouldn't otherwise be seen within one crate. This can probably be solved with a few new minor language features:

First, the #[inline] annotation could be processed on static variables. If the variable does not have any internal mutability, then the definition can be inlined into other LLVM modules and tagged with available_externally. That means that the contents are available for optimization, but if you're taking the address it's available elsewhere.
Second, when we inline the contents of a static across modules, function pointers should also be chased and inlined if they're tagged with #[inline].

Those two pieces I believe should provide enough inlining opportunities to ensure that accesses are as fast when done from external crates as they are done with internal crates.

Original description

This hurts performance for the locks in std::sync, as they call std::rt::unwind::panicking() (which just reads a thread-local). For uncontended locks the cost is quite significant.

There are two problems:

std::rt::unwind::panicking() isn't marked inline. This is trivial to solve.
Accessing a thread_local! goes through function pointers, which LLVM fails to see through. These are the __getit functions in libstd/thread/local.rs. Consider these two files:

// tls.rs
use std::cell::Cell;
thread_local! { static FOO: Cell<bool> = Cell::new(false) }
#[inline]
pub fn get_foo() -> bool {
    FOO.with(|s| s.get())
}

// other.rs
extern crate tls;
#[inline(never)]
fn call_foo() -> bool {
    tls::get_foo()
}

call_foo gets the following IR with everything compiled with full optimization. Note the call through a function pointer:

define internal fastcc zeroext i1 @_ZN8call_foo20h55fb2c2ba5981f51PaaE() unnamed_addr #3 {
entry-block:
  %0 = load %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* ()** getelementptr inbounds (%"1.std::thread::local::LocalKey<core::cell::Cell<bool>>"* @_ZN3FOO20hefc7cdadc988b7defaaE, i64 0, i32 0), align 8
  %1 = tail call dereferenceable(4) %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %0()
  %2 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 0
  %3 = load i8* %2, align 1, !range !13
  %cond.i.i = icmp eq i8 %3, 1
  br i1 %cond.i.i, label %match_case.i.i, label %match_else.i.i

match_else.i.i:                                   ; preds = %entry-block
  %4 = load i8 ()** getelementptr inbounds (%"1.std::thread::local::LocalKey<core::cell::Cell<bool>>"* @_ZN3FOO20hefc7cdadc988b7defaaE, i64 0, i32 1), align 8
  %5 = tail call i8 %4()
  %6 = zext i8 %5 to i16
  %7 = shl nuw i16 %6, 8
  %8 = or i16 %7, 1
  %9 = bitcast %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1 to i16*
  store i16 %8, i16* %9, align 1
  %10 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 2, i64 0
  %11 = bitcast i8* %10 to %"2.core::cell::Cell<bool>"*
  br label %_ZN7get_foo20h73631102cae2c5b4lbaE.exit

match_case.i.i:                                   ; preds = %entry-block
  %12 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 2
  %13 = bitcast [1 x i8]* %12 to %"2.core::cell::Cell<bool>"*
  br label %_ZN7get_foo20h73631102cae2c5b4lbaE.exit

_ZN7get_foo20h73631102cae2c5b4lbaE.exit:          ; preds = %match_else.i.i, %match_case.i.i
  %.0.i.i = phi %"2.core::cell::Cell<bool>"* [ %13, %match_case.i.i ], [ %11, %match_else.i.i ]
  %.0.idx.i.i = getelementptr %"2.core::cell::Cell<bool>"* %.0.i.i, i64 0, i32 0, i32 0
  %.0.idx.val.i.i = load i8* %.0.idx.i.i, align 1
  %14 = icmp ne i8 %.0.idx.val.i.i, 0
  ret i1 %14
}

The text was updated successfully, but these errors were encountered:

alexcrichton · 2015-05-28T18:40:34Z

@veddan can you quantify the cost that you are seeing? For example this is what I get:

#![feature(test)]
#![allow(warnings)]
extern crate test;

use std::sync::Mutex;

#[bench]
fn pthreads(b: &mut test::Bencher) {
    let mut buf = [0u8; 1024];
    enum pthread_mutex_t {}
    enum pthread_attr_t {}
    extern {
        fn pthread_mutex_init(lock: *mut pthread_mutex_t,
                              attr: *mut pthread_attr_t) -> i32;
        fn pthread_mutex_lock(lock: *mut pthread_mutex_t) -> i32;
        fn pthread_mutex_unlock(lock: *mut pthread_mutex_t) -> i32;
    }

    unsafe {
        let ptr = buf.as_ptr() as *mut pthread_mutex_t;
        assert_eq!(pthread_mutex_init(ptr, 0 as *mut _), 0);
        b.iter(|| {
            assert_eq!(pthread_mutex_lock(ptr), 0);
            assert_eq!(pthread_mutex_unlock(ptr), 0);
        });
    }
}

#[bench]
fn libstd(b: &mut test::Bencher) {
    let m = Mutex::new(());
    b.iter(|| drop(m.lock()));
}

$ rustc foo.rs --bench -Copt-level=3
$ ./bench --bench
running 2 tests                                              
test libstd   ... bench:        21 ns/iter (+/- 1)           
test pthreads ... bench:        18 ns/iter (+/- 4)           

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured   
$ rustc foo.rs --bench -Copt-level=3 -Clto
$ ./bench --bench
running 2 tests                                              
test libstd   ... bench:        16 ns/iter (+/- 1)           
test pthreads ... bench:        15 ns/iter (+/- 4)           

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

When inlining everything with LTO the costs basically entirely go away, and when calling across the library boundary (note that there's no virtual call, and this is intentional!) the cost is quite negligible.

I definitely agree that the current design requires a virtual call when inlined across crates, and I've experimented with tweaking it such that this is not necessary, but I don't think that it's easily possible due to various safety concerns. I just want to get an idea of what perf hit you're seeing.

eddyb · 2015-06-01T17:39:14Z

@alexcrichton wouldn't this be solved by marking the static containing the accessors, and the accessors themselves as #[inline]?
AFAIK, we should still be generating only one global for the inner static.

alexcrichton · 2015-06-01T18:20:56Z

Functionally, yes, but I don't think that #[inline] does anything on globals today, so that's something that would have to be implemented. This would not be fixed by just adding #[inline] to the __getit function because it's only accessed through the static itself, whose value isn't currently inlined.

eddyb · 2015-06-01T18:51:44Z

@alexcrichton oh, I could've sworn I've seen #[inline] static ... before, is it silently ignored?
Another option would be to switch LocalKey to const and turn on rvalue promotion.

alexcrichton · 2015-06-01T18:54:55Z

I could've sworn I've seen #[inline] static ... before, is it silently ignored?

I believe it is being silently ignored, yes.

Another option would be to switch LocalKey to const and turn on rvalue promotion.

This may be possible, but the semantics of ELF-like TLS and OS-based TLS should be the same, and OS-based TLS requires a static somewhere in some form.

eddyb · 2015-06-01T22:23:31Z

@alexcrichton the OS-based TLS static is a hidden implementation detail, it would still be there and a reference to it would be returned by the accessor function.

alexcrichton · 2015-06-01T23:25:07Z

It's somewhat of an implementation detail, but there still needs to be one true address of the TLS static. Right now a const which is just the address of something is still re-translated in downstream crates, which could create duplicate keys.

eddyb · 2015-06-01T23:45:57Z

@alexcrichton just the pair of accessors would be const, and that's not going to affect anything.
It is a compatibility issue, though, because it allows let key = SOME_KEY; which you can't do with a static.
So then, should #[inline] be implemented for statics? IIRC I saw it being used in format_args! output.

alexcrichton · 2015-06-01T23:50:37Z

At some point in the past I believe #[inline] did something on statics, but that has since been removed. And also, yes, I want to make sure the semantics are the same regardless of platform, so if one requires a static they should both use a static.

eddyb · 2015-06-02T01:15:22Z

@alexcrichton I believe we are talking past each other: I'm not suggesting we touch the platform-dependent hidden static at all.
There is a truly immutable "static" (without a significant address, either) being used as a facade, and that is what needs to be inlined across crates.

As it stands, we allow taking &'static LocalKey borrows and sending them across threads (because it's just a safe getter).
That makes it backwards incompatible to remove the indirection of the getter, so that is out of the question (maybe not for ScopedKey, though).

I would personally prefer using const, methods taking self (by-value), and storing a &'static KeyInner in the non-#[thread_local] case (instead of a getter), but not all of that is doable now.

alexcrichton · 2015-06-02T16:37:46Z

There is a truly immutable "static" (without a significant address, either) being used as a facade

Could you elaborate on why you think this doesn't have a significant address? For OS-based TLS it needs a significant address as it mutates the memory and everyone needs to see the update. For ELF-based TLS it's also significant because of the mutations which need to be visible (and everyone needs to reference the same memory).

That makes it backwards incompatible to remove the indirection of the getter,

Right, but we just need to inline the getter into destination crates, probably via an available_externally static which is marked constant in LLVM. This getter is trivially inlineable, so it's just a codegen problem to overcome.

I would personally prefer using const

Unfortunately due to the current usage of static, I think this is out of the question.

alexcrichton · 2015-06-02T16:38:36Z

and storing a &'static KeyInner in the non-#[thread_local] case (instead of a getter

Oh another thing is that a const is not allowed to reference the address of a static:

static FOO: u32 = 4;
const BAR: &'static u32 = &FOO;

foo.rs:2:28: 2:31 error: constants cannot refer to other statics, insert an intermediate constant instead [E0013]
foo.rs:2 const BAR: &'static u32 = &FOO;
                                    ^~~
error: aborting due to previous error

eddyb · 2015-06-02T16:53:01Z

Okay, const is really out of the way, too many issues.

As for the other thing, I'm talking about the outer static holding LocalKey which is a pair of accessors (the "initializer" and "getter" function pointers).

For both OS-based and #[thread_local] TLS, there is a hidden static holding the __KeyInner, which does indeed have a significant address and interior mutability. I was never talking about that one in this thread.
That static doesn't need to be changed, it's perfect for all I care.

alexcrichton · 2016-07-12T17:44:39Z

I've updated the description of this issue with a more detailed explanation about what I believe the solution here is (basically just a summary of the discussion @eddyb and I had a year ago I believe)

DemiMarie · 2016-07-14T01:42:29Z

Just how fast can TLS get? I thought that on x86-64 it could indirect off of a segment register or something like that, and be really fast.

cyplo · 2017-03-04T08:19:01Z

Hi, any consensus on this one ?

alexcrichton · 2017-03-06T15:23:41Z

@cyplo no change to the updated description, which I believe is still the best solution

eddyb · 2017-08-10T16:47:33Z

Rvalue promotion is getting close to stabilization (#38865 (comment)), so we'll use that + const.

The other concerns raised in the past seem inconsequential now - allowing references to statics in consts will be trivial once we start using miri, for example - there's nothing wrong with it, we can even allow reading a (non-#[thread_local]!) static from a const - it'd just observe the initializer.

rasky · 2019-01-04T17:32:44Z

I'm writing an application whose performance is impacted by the fact that TLS accesses are not generated using a segment register, but go through a function call and are then not subject to usual optimizations like CSE. I found this issue that seems the culprit. It looks like #50252 fixed the problem for Linux and Mac but not Windows (as far as I can tell?). Given that 9 months are passed, would it be acceptable to refresh #50252 and activate it only for Linux/Mac?

…ss-crate, r=Mark-Simulacrum std: Attempt again to inline thread-local-init across crates Issue rust-lang#25088 has been part of `thread_local!` for quite some time now. Historical attempts have been made to add `#[inline]` to `__getit` in rust-lang#43931, rust-lang#50252, and rust-lang#59720, but these attempts ended up not landing at the time due to segfaults on Windows. In the interim though with `const`-initialized thread locals AFAIK this is the only remaining bug which is why you might want to use `#[thread_local]` over `thread_local!`. As a result I figured it was time to resubmit this and see how it fares on CI and if I can help debugging any issues that crop up. Closes rust-lang#25088

steveklabnik added the A-codegen Area: Code generation label May 7, 2015

alexcrichton mentioned this issue Jun 1, 2015

unsafe: ScopedKey allows for Sync-ification of non-Sync data #25894

Closed

veddan mentioned this issue Jul 16, 2015

Improve locking performance veddan/rust-concurrent-hashmap#4

Closed

alexcrichton mentioned this issue Jul 9, 2016

Catching panics is eating into Servo's DOM performance #34727

Closed

3 tasks

Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jul 22, 2017

eddyb mentioned this issue Aug 17, 2017

Make the LocalKey facade of thread_local! inlineable cross-crate. #43931

Merged

alexcrichton mentioned this issue Apr 28, 2018

Add #[inline] back on __getit (fixes #25088) #50252

Closed

This was referenced Apr 5, 2019

TLS lookups in libsyntax_pos are expensive #59718

Closed

Inline __getit #59720

Closed

jonas-schievink added A-thread-locals Area: Thread local storage (TLS) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 31, 2020

alexcrichton mentioned this issue May 3, 2021

"const" initialized thread locals in rustc #84833

Merged

alexcrichton mentioned this issue May 3, 2021

std: Attempt again to inline thread-local-init across crates #84876

Merged

bors closed this as completed in 641d3b0 May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to thread locals isn't inlined across crates #25088

Access to thread locals isn't inlined across crates #25088

veddan commented May 4, 2015 •

edited by alexcrichton

Loading

alexcrichton commented May 28, 2015

eddyb commented Jun 1, 2015

alexcrichton commented Jun 1, 2015

eddyb commented Jun 1, 2015

alexcrichton commented Jun 1, 2015

eddyb commented Jun 1, 2015

alexcrichton commented Jun 1, 2015

eddyb commented Jun 1, 2015

alexcrichton commented Jun 1, 2015

eddyb commented Jun 2, 2015

alexcrichton commented Jun 2, 2015

alexcrichton commented Jun 2, 2015

eddyb commented Jun 2, 2015

alexcrichton commented Jul 12, 2016

DemiMarie commented Jul 14, 2016

cyplo commented Mar 4, 2017

alexcrichton commented Mar 6, 2017

eddyb commented Aug 10, 2017

rasky commented Jan 4, 2019

Access to thread locals isn't inlined across crates #25088

Access to thread locals isn't inlined across crates #25088

Comments

veddan commented May 4, 2015 • edited by alexcrichton Loading

Proposed solution

Original description

alexcrichton commented May 28, 2015

eddyb commented Jun 1, 2015

alexcrichton commented Jun 1, 2015

eddyb commented Jun 1, 2015

alexcrichton commented Jun 1, 2015

eddyb commented Jun 1, 2015

alexcrichton commented Jun 1, 2015

eddyb commented Jun 1, 2015

alexcrichton commented Jun 1, 2015

eddyb commented Jun 2, 2015

alexcrichton commented Jun 2, 2015

alexcrichton commented Jun 2, 2015

eddyb commented Jun 2, 2015

alexcrichton commented Jul 12, 2016

DemiMarie commented Jul 14, 2016

cyplo commented Mar 4, 2017

alexcrichton commented Mar 6, 2017

eddyb commented Aug 10, 2017

rasky commented Jan 4, 2019

veddan commented May 4, 2015 •

edited by alexcrichton

Loading