-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLS for MacOS 64-bit #1568
Comments
I think OS X uses GS exclusively for user-space TLS on x86_64 and FS is not used, so setting FS shouldn't be required. The rest of my comment will talk about reading GS. Can you expand upon your other questions above in case I can help? FWIW it doesn't look like (RD|WR)(FS|GS)BASE are enabled in userspace on OS X. afaict TLS on OS X by default is managed in dyld and libpthread sources on opensource.apple.com, which ends up in the /usr/lib/system/libdyld.dylib and /usr/lib/dyld binaries. You can read the GS_base on x86_64 on 10.11.6 using this nasty hack: https://github.com/lunixbochs/precorn/blob/master/src/osx/x86_64.c#L44 All of the pthread structs are private and they removed the get_cthread MDEP syscalls on x86_64, so I'm not sure if there's an actual portable way to do this. The only forward-compat problem I'm actually worried about with this method is the magic offset into the pthread_t struct (28 * 8). This can be made more robust as follows:
If you want to make it even more robust, read ~32 bytes to grab the first few slots and do a larger memory search, or link libpthread, reset one of the slots to a magic value, and search for it. I went through all this mess to make QEMU work on x86_64 OS X in a similar way to DynamoRIO, only to run into the fact QEMU has terrible AVX2 support, so now I'm rooting for you :) |
Thank you for the information. The bigger problem than reading the %gs base though is that there's no way to set the %fs base, requiring implementation of a different scheme for TLS from what we use on other unix-ish x86 platforms. We are short on manpower for Mac work, unfortunately. We would welcome contributions. |
I found some notes from a while back that seem to be an expansion of the initial entry's notes. Pasting here for reference: option #1 for DR: is there some free padding space in TLS mmap?Maybe beyond pthread data structs, since stack beyond that is page-aligned? WINNER option #2 for DR: early injection and use privlib w/ larger mmap + app mangling?Add extra page to TLS mmap, maybe to the left so out of way (16-bit offs We'd need to mangle the app's references even w/o priv loader. option #3 for DR: like Windows, can we steal some slots from app's TLS?Like Windows, request official TLS slots like app would and take ~20 (and Qin: on Linux static TLS is directly addressable so app could take it all could have DR loaded by ld.so request a bunch of static option #4: table lookup by thread id: too slow thoughxref tls_table in os.c option #5: replace TLS mmap at DR init permanently w/ a larger onefirst part is copy of original Qin: 1st TLS may be brk not mmap, and later ones combine TLS and stack. option #6: steal register? later update: leverage ARM code?Though this only helps for the code cache and our gencode: in our C code Stealing a register is more work than the other options. for priv libs, have to mangle app's refsIf go w/ option #2, have to do so even w/o priv libs. |
Re-visiting the options here. Unlike Linux, where each thread's TLS is allocated in its own mmap and it's easy to add some space in order to have DR share the priv lib's layout, the Mac TLS is combined with the pthread library (initial thread) or thread stack (new threads):
That makes it harder to adjust the sizes. Given the extra work in making our own loader, I'm looking at reviving @shawndenbow's PR #2293 approach of implementing #3 above and stealing some slots from the app. This is definitely the simplest solution to make progress w/o having to first or simultaneously solve complex problems outside of TLS. |
Following Qin's suggestion: since we're injecting late anyway and libpthread is always there, we could have DR depend on libpthread and invoke pthread_key_create a bunch of times to reserve TLS slots: i.e., actually use the user-mode interfaces to get our resources. Long-term we'd like to be independent of the user libs but that will require more developer effort than we currently have. |
We could use the same approach with private libraries: load a private pthread lib and invoke its pthread_key_create. Unlike Windows with its 64 dynamic slots, on Mac there are 768. The Mac system libraries reserve 256, leaving 512: presumably we could steal some from the tool, which should not be using a crazy amount of non-system libs. That's not as bad as Windows where we steal them from the app. That would leave what to do with early injection and no client where normally we wouldn't bother trying to load any private libs: if it doesn't complicate the code too much we could just do an mmap there and set up the gsbase ourselves. |
For 64-bit MacOS, there is no way to set the %fs base which stops us from using DR's scheme used on other unix platforms. This commit provides initial support to MacOS 64-bit by stealing a TLS slot from the app for DR's TLS base. + implement is_thread_tls_initialized for MacOS 64-bit + implement tls_thread_init and tls_thread_free + set MACOS64 define in cmake script + add WRITE_TLS_SLOT_IMM etc. for MacOS 64-bit + add read_thread_register for MacOS 64-bit to get pthread_t base Issue: #1568, #1979
For 64-bit MacOS, there is no way to set the %fs base which stops us from using DR's scheme used on other unix platforms. This commit provides initial support to MacOS 64-bit by stealing a TLS slot from the app for DR's TLS base. + implement is_thread_tls_initialized for MacOS 64-bit + implement tls_thread_init and tls_thread_free + set MACOS64 define in cmake script + add WRITE_TLS_SLOT_IMM etc. for MacOS 64-bit + add read_thread_register for MacOS 64-bit to get pthread_t base Issue: #1568, #1979
Uses pthread_key_create() to allocate enough contiguous and aligned TLS slots to fit our os_local_state_t struct. This makes it easier to share Linux code for Mac64. Keeps the scheme from ce8e803 of storing a pointer to the base of os_local_state_t in TLS slot 6. This is indirection we don't need with the entire os_local_state_t struct in TLS but it is not clear we can take that many TLS slots for large applications, so I'm leaving this mixture until we're sure which direction to go in. Disables the options -mangle_app_seg and -safe_read_tls_init for Mac64. Issue: #1568, #1979
Uses pthread_key_create() to allocate enough contiguous and aligned TLS slots to fit our os_local_state_t struct. This makes it easier to share Linux code for Mac64. Keeps the scheme from ce8e803 of storing a pointer to the base of os_local_state_t in TLS slot 6. This is indirection we don't need with the entire os_local_state_t struct in TLS but it is not clear we can take that many TLS slots for large applications, so I'm leaving this mixture until we're sure which direction to go in. Disables the options -mangle_app_seg and -safe_read_tls_init for Mac64. Issue: #1568, #1979
I put in the pthread_key_create approach in #3832. It works for small apps at least. |
Split from #58 as this may have some overlap with ARM work. I'm going to paste my notes here:
** TODO 64-bit: can only set gs, not fs, and can't read gs
For 64-bit: thread_fast_set_cthread_self64 which sets MSR_IA32_KERNEL_GS_BASE.
May not be a way to set MSR for FS: no reference to MSR_IA32_KERNEL_FS_BASE in
xnu sources.
*** TODO option #1 for DR: is there some free padding space in TLS mmap?
Maybe beyond pthread data structs, since stack beyond that is page-aligned?
*** TODO option #2 for DR: early injection and use privlib w/ larger mmap + app mangling?
Add extra page to TLS mmap, maybe to the left so out of way (16-bit offs
will still reach).
We'd need to mangle the app's references even w/o priv loader.
*** TODO option #3 for DR: like Windows, can we steal some slots from app's TLS?
Like Windows, request official TLS slots like app would and take ~20 (and
ensure directly addressable)?
*** TODO for priv libs, have to mangle app's refs
If go w/ option #2, have to do so even w/o priv libs.
*** TODO how read current gs?
*** TODO on Ivybridge+, use OP_wrfsbase or OP_wrgsbase?!
*** TODO steal register? later update: leverage ARM code?
Though this only helps for the code cache and our gencode: in our C code
we need a separate TLS mechanism.
The text was updated successfully, but these errors were encountered: