Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to thread locals isn't inlined across crates #25088

Closed
veddan opened this issue May 4, 2015 · 21 comments · Fixed by #84876
Closed

Access to thread locals isn't inlined across crates #25088

veddan opened this issue May 4, 2015 · 21 comments · Fixed by #84876
Labels
A-codegen Area: Code generation A-thread-locals Area: Thread local storage (TLS) C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@veddan
Copy link
Contributor

veddan commented May 4, 2015

Proposed solution

Right now access to thread locals defined by thread_local! aren't inlined across crates, causing performance problems that wouldn't otherwise be seen within one crate. This can probably be solved with a few new minor language features:

  • First, the #[inline] annotation could be processed on static variables. If the variable does not have any internal mutability, then the definition can be inlined into other LLVM modules and tagged with available_externally. That means that the contents are available for optimization, but if you're taking the address it's available elsewhere.
  • Second, when we inline the contents of a static across modules, function pointers should also be chased and inlined if they're tagged with #[inline].

Those two pieces I believe should provide enough inlining opportunities to ensure that accesses are as fast when done from external crates as they are done with internal crates.

Original description

This hurts performance for the locks in std::sync, as they call std::rt::unwind::panicking() (which just reads a thread-local). For uncontended locks the cost is quite significant.

There are two problems:

  1. std::rt::unwind::panicking() isn't marked inline. This is trivial to solve.
  2. Accessing a thread_local! goes through function pointers, which LLVM fails to see through. These are the __getit functions in libstd/thread/local.rs. Consider these two files:
// tls.rs
use std::cell::Cell;
thread_local! { static FOO: Cell<bool> = Cell::new(false) }
#[inline]
pub fn get_foo() -> bool {
    FOO.with(|s| s.get())
}
// other.rs
extern crate tls;
#[inline(never)]
fn call_foo() -> bool {
    tls::get_foo()
}

call_foo gets the following IR with everything compiled with full optimization. Note the call through a function pointer:

define internal fastcc zeroext i1 @_ZN8call_foo20h55fb2c2ba5981f51PaaE() unnamed_addr #3 {
entry-block:
  %0 = load %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* ()** getelementptr inbounds (%"1.std::thread::local::LocalKey<core::cell::Cell<bool>>"* @_ZN3FOO20hefc7cdadc988b7defaaE, i64 0, i32 0), align 8
  %1 = tail call dereferenceable(4) %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %0()
  %2 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 0
  %3 = load i8* %2, align 1, !range !13
  %cond.i.i = icmp eq i8 %3, 1
  br i1 %cond.i.i, label %match_case.i.i, label %match_else.i.i

match_else.i.i:                                   ; preds = %entry-block
  %4 = load i8 ()** getelementptr inbounds (%"1.std::thread::local::LocalKey<core::cell::Cell<bool>>"* @_ZN3FOO20hefc7cdadc988b7defaaE, i64 0, i32 1), align 8
  %5 = tail call i8 %4()
  %6 = zext i8 %5 to i16
  %7 = shl nuw i16 %6, 8
  %8 = or i16 %7, 1
  %9 = bitcast %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1 to i16*
  store i16 %8, i16* %9, align 1
  %10 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 2, i64 0
  %11 = bitcast i8* %10 to %"2.core::cell::Cell<bool>"*
  br label %_ZN7get_foo20h73631102cae2c5b4lbaE.exit

match_case.i.i:                                   ; preds = %entry-block
  %12 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 2
  %13 = bitcast [1 x i8]* %12 to %"2.core::cell::Cell<bool>"*
  br label %_ZN7get_foo20h73631102cae2c5b4lbaE.exit

_ZN7get_foo20h73631102cae2c5b4lbaE.exit:          ; preds = %match_else.i.i, %match_case.i.i
  %.0.i.i = phi %"2.core::cell::Cell<bool>"* [ %13, %match_case.i.i ], [ %11, %match_else.i.i ]
  %.0.idx.i.i = getelementptr %"2.core::cell::Cell<bool>"* %.0.i.i, i64 0, i32 0, i32 0
  %.0.idx.val.i.i = load i8* %.0.idx.i.i, align 1
  %14 = icmp ne i8 %.0.idx.val.i.i, 0
  ret i1 %14
}
@steveklabnik steveklabnik added the A-codegen Area: Code generation label May 7, 2015
@alexcrichton
Copy link
Member

@veddan can you quantify the cost that you are seeing? For example this is what I get:

#![feature(test)]
#![allow(warnings)]
extern crate test;

use std::sync::Mutex;

#[bench]
fn pthreads(b: &mut test::Bencher) {
    let mut buf = [0u8; 1024];
    enum pthread_mutex_t {}
    enum pthread_attr_t {}
    extern {
        fn pthread_mutex_init(lock: *mut pthread_mutex_t,
                              attr: *mut pthread_attr_t) -> i32;
        fn pthread_mutex_lock(lock: *mut pthread_mutex_t) -> i32;
        fn pthread_mutex_unlock(lock: *mut pthread_mutex_t) -> i32;
    }

    unsafe {
        let ptr = buf.as_ptr() as *mut pthread_mutex_t;
        assert_eq!(pthread_mutex_init(ptr, 0 as *mut _), 0);
        b.iter(|| {
            assert_eq!(pthread_mutex_lock(ptr), 0);
            assert_eq!(pthread_mutex_unlock(ptr), 0);
        });
    }
}

#[bench]
fn libstd(b: &mut test::Bencher) {
    let m = Mutex::new(());
    b.iter(|| drop(m.lock()));
}
$ rustc foo.rs --bench -Copt-level=3
$ ./bench --bench
running 2 tests                                              
test libstd   ... bench:        21 ns/iter (+/- 1)           
test pthreads ... bench:        18 ns/iter (+/- 4)           

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured   
$ rustc foo.rs --bench -Copt-level=3 -Clto
$ ./bench --bench
running 2 tests                                              
test libstd   ... bench:        16 ns/iter (+/- 1)           
test pthreads ... bench:        15 ns/iter (+/- 4)           

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured   

When inlining everything with LTO the costs basically entirely go away, and when calling across the library boundary (note that there's no virtual call, and this is intentional!) the cost is quite negligible.

I definitely agree that the current design requires a virtual call when inlined across crates, and I've experimented with tweaking it such that this is not necessary, but I don't think that it's easily possible due to various safety concerns. I just want to get an idea of what perf hit you're seeing.

@eddyb
Copy link
Member

eddyb commented Jun 1, 2015

@alexcrichton wouldn't this be solved by marking the static containing the accessors, and the accessors themselves as #[inline]?
AFAIK, we should still be generating only one global for the inner static.

@alexcrichton
Copy link
Member

Functionally, yes, but I don't think that #[inline] does anything on globals today, so that's something that would have to be implemented. This would not be fixed by just adding #[inline] to the __getit function because it's only accessed through the static itself, whose value isn't currently inlined.

@eddyb
Copy link
Member

eddyb commented Jun 1, 2015

@alexcrichton oh, I could've sworn I've seen #[inline] static ... before, is it silently ignored?
Another option would be to switch LocalKey to const and turn on rvalue promotion.

@alexcrichton
Copy link
Member

I could've sworn I've seen #[inline] static ... before, is it silently ignored?

I believe it is being silently ignored, yes.

Another option would be to switch LocalKey to const and turn on rvalue promotion.

This may be possible, but the semantics of ELF-like TLS and OS-based TLS should be the same, and OS-based TLS requires a static somewhere in some form.

@eddyb
Copy link
Member

eddyb commented Jun 1, 2015

@alexcrichton the OS-based TLS static is a hidden implementation detail, it would still be there and a reference to it would be returned by the accessor function.

@alexcrichton
Copy link
Member

It's somewhat of an implementation detail, but there still needs to be one true address of the TLS static. Right now a const which is just the address of something is still re-translated in downstream crates, which could create duplicate keys.

@eddyb
Copy link
Member

eddyb commented Jun 1, 2015

@alexcrichton just the pair of accessors would be const, and that's not going to affect anything.
It is a compatibility issue, though, because it allows let key = SOME_KEY; which you can't do with a static.
So then, should #[inline] be implemented for statics? IIRC I saw it being used in format_args! output.

@alexcrichton
Copy link
Member

At some point in the past I believe #[inline] did something on statics, but that has since been removed. And also, yes, I want to make sure the semantics are the same regardless of platform, so if one requires a static they should both use a static.

@eddyb
Copy link
Member

eddyb commented Jun 2, 2015

@alexcrichton I believe we are talking past each other: I'm not suggesting we touch the platform-dependent hidden static at all.
There is a truly immutable "static" (without a significant address, either) being used as a facade, and that is what needs to be inlined across crates.

As it stands, we allow taking &'static LocalKey borrows and sending them across threads (because it's just a safe getter).
That makes it backwards incompatible to remove the indirection of the getter, so that is out of the question (maybe not for ScopedKey, though).

I would personally prefer using const, methods taking self (by-value), and storing a &'static KeyInner in the non-#[thread_local] case (instead of a getter), but not all of that is doable now.

@alexcrichton
Copy link
Member

There is a truly immutable "static" (without a significant address, either) being used as a facade

Could you elaborate on why you think this doesn't have a significant address? For OS-based TLS it needs a significant address as it mutates the memory and everyone needs to see the update. For ELF-based TLS it's also significant because of the mutations which need to be visible (and everyone needs to reference the same memory).

That makes it backwards incompatible to remove the indirection of the getter,

Right, but we just need to inline the getter into destination crates, probably via an available_externally static which is marked constant in LLVM. This getter is trivially inlineable, so it's just a codegen problem to overcome.

I would personally prefer using const

Unfortunately due to the current usage of static, I think this is out of the question.

@alexcrichton
Copy link
Member

and storing a &'static KeyInner in the non-#[thread_local] case (instead of a getter

Oh another thing is that a const is not allowed to reference the address of a static:

static FOO: u32 = 4;
const BAR: &'static u32 = &FOO;
foo.rs:2:28: 2:31 error: constants cannot refer to other statics, insert an intermediate constant instead [E0013]
foo.rs:2 const BAR: &'static u32 = &FOO;
                                    ^~~
error: aborting due to previous error

@eddyb
Copy link
Member

eddyb commented Jun 2, 2015

Okay, const is really out of the way, too many issues.

As for the other thing, I'm talking about the outer static holding LocalKey which is a pair of accessors (the "initializer" and "getter" function pointers).

For both OS-based and #[thread_local] TLS, there is a hidden static holding the __KeyInner, which does indeed have a significant address and interior mutability. I was never talking about that one in this thread.
That static doesn't need to be changed, it's perfect for all I care.

@alexcrichton
Copy link
Member

I've updated the description of this issue with a more detailed explanation about what I believe the solution here is (basically just a summary of the discussion @eddyb and I had a year ago I believe)

@DemiMarie
Copy link
Contributor

Just how fast can TLS get? I thought that on x86-64 it could indirect off of a segment register or something like that, and be really fast.

@cyplo
Copy link
Contributor

cyplo commented Mar 4, 2017

Hi, any consensus on this one ?

@alexcrichton
Copy link
Member

@cyplo no change to the updated description, which I believe is still the best solution

@Mark-Simulacrum Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jul 22, 2017
@eddyb
Copy link
Member

eddyb commented Aug 10, 2017

Rvalue promotion is getting close to stabilization (#38865 (comment)), so we'll use that + const.

The other concerns raised in the past seem inconsequential now - allowing references to statics in consts will be trivial once we start using miri, for example - there's nothing wrong with it, we can even allow reading a (non-#[thread_local]!) static from a const - it'd just observe the initializer.

@rasky
Copy link

rasky commented Jan 4, 2019

I'm writing an application whose performance is impacted by the fact that TLS accesses are not generated using a segment register, but go through a function call and are then not subject to usual optimizations like CSE. I found this issue that seems the culprit. It looks like #50252 fixed the problem for Linux and Mac but not Windows (as far as I can tell?). Given that 9 months are passed, would it be acceptable to refresh #50252 and activate it only for Linux/Mac?

@jonas-schievink jonas-schievink added A-thread-locals Area: Thread local storage (TLS) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 31, 2020
bors added a commit to rust-lang-ci/rust that referenced this issue May 19, 2021
…ss-crate, r=Mark-Simulacrum

std: Attempt again to inline thread-local-init across crates

Issue rust-lang#25088 has been part of `thread_local!` for quite some time now.
Historical attempts have been made to add `#[inline]` to `__getit`
in rust-lang#43931, rust-lang#50252, and rust-lang#59720, but these attempts ended up not landing
at the time due to segfaults on Windows.

In the interim though with `const`-initialized thread locals AFAIK this
is the only remaining bug which is why you might want to use
`#[thread_local]` over `thread_local!`. As a result I figured it was
time to resubmit this and see how it fares on CI and if I can help
debugging any issues that crop up.

Closes rust-lang#25088
@bors bors closed this as completed in 641d3b0 May 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-thread-locals Area: Thread local storage (TLS) C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants