Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__errno_location crash after call to __cxa_guard_acquire #2268

Closed
MagnusS opened this issue Aug 28, 2024 · 8 comments
Closed

__errno_location crash after call to __cxa_guard_acquire #2268

MagnusS opened this issue Aug 28, 2024 · 8 comments

Comments

@MagnusS
Copy link
Member

MagnusS commented Aug 28, 2024

When enabling SMP the initial call to RNG::get().init() in apic_revenant crashes. RNG::get().init() returns a static variable that is initialised by the first call to get() (here). After stepping through the code I see that the crash is caused by a surrounding call to __cxa_guard_acquire that protects the initialisation of the static variable, which then calls a syscall that returns ENOSYS. This value is written to __errno_location by musl, which then triggers a CPU exception.

There seems to be multiple issues here:

__cxa_guard_acquire is added by the compiler to protect the initialisation of static variables at runtime (such as in RNG). The implementation is here. Depending on whether we have threads enabled in LIBCXX it will call various locking mechanisms in libc/musl, which have to be initialised first. If this happens early in the boot process, the kernel/musl may be unable to handle them. I'm not sure what the best approach to fix this would be, but ideally we shouldn't use code that relied on this feature until we get far enough in the boot process.

__errno_location isn't set up properly, at least not on other cores than core 0, and syscalls will write to the memory location it points to. In combination with the __cxa_guard_acquire issue above, this may happen unexpectedly. After a call to syscall, musl's syscall_ret writes errno to the location stored in __errno_location function from the pthread struct. The struct is retrieved from %fs:0 here. It looks like TLS may not be properly set up, there's a function in IncludeOS to do it here -- but it doesn't look like it's called from apic_revenant.

I'm not sure why this has worked previously, but perhaps the new toolchain/build system happens to make errno_val point outside valid memory, so we now get a crash instead of potential instability? Could this have been causing issues such as #2252?

To reproduce, first enable SMP, then run nix-shell --argstr unikernel test/kernel/integration/smp --run ./test.py

(this is on v0.16.0-release branch)

@fwsGonzo
Copy link
Member

You can compile a module with -fno-threadsafe-statics to solve the constructor issue.

@MagnusS
Copy link
Member Author

MagnusS commented Aug 28, 2024

You can compile a module with -fno-threadsafe-statics to solve the constructor issue.

I guess that could at least avoid the calls during the boot process, we just have to be careful about potential races.

@fwsGonzo do you know why TLS may not be initialised properly? Was it removed at some point?

@fwsGonzo
Copy link
Member

Sadly, I don't remember - just keep in mind that global construction order is one of the big fiascos of C++:

https://en.cppreference.com/w/cpp/language/siof

Maybe the RNG can be initialized lazily. BSS will at least be zero, and since we have to re-seed it occasionally, just making it have to reseed on first use may be a way forward?

@alfreb
Copy link
Contributor

alfreb commented Aug 29, 2024

I like that idea of lazy RNG - we might even want to make it a plugin or driver. I think in our case the global construction order is not the main issue, but rather that we can't depend on any constructors being called until libc is fully initialized - and we now know that cxa_guard is one of the reasons. We do decide when to initialize, and can also initialize different translation units in the order we prefer, via the linker script trick of putting them in different sections. But yea, I think we need to extend the rules:

  1. No syscalls can be allowed before libc is initialized
  2. No dependency on static class members or global constructors until libc is initialized
  3. No exceptions can be thrown before libc is initialized (I think they depend on global ctors, but definitely on allocation)

There may be more rules - and I don't think we can enforce them automatically - but it should be a small subset of the kernel features that needs to be ready before we can initialize libc, so we should make sure to clearly identify that. One way of doing it would be to only allow pure C, but I think we can use a pretty decent subset of C++, including PMR allocators.

@MagnusS
Copy link
Member Author

MagnusS commented Sep 4, 2024

After thinking a bit more about this I think it would make sense to compile the kernel with -fno-threadsafe-statics to avoid the lock guard for now. There's only a single user until we enable SMP (but then we have to lock manually). Perhaps the service could still be compiled with threadsafety on?

I added this in #2273 (see PR and commits for more details). We could change this if there's a better approach for delaying initialisation later.

@fwsGonzo
Copy link
Member

fwsGonzo commented Sep 4, 2024

No need to compile the entire OS with no-threadsafe-statics. You can use this in CMake:

set_source_files_properties(file.cpp PROPERTIES
	COMPILE_DEFINITIONS MYDEF=1)

And

set_source_files_properties(file.cpp
	PROPERTIES COMPILE_FLAGS -fno-builtin)

I think a lot of footguns have been removed when syscalls panic during boot, so being selective here is fine.

@MagnusS
Copy link
Member Author

MagnusS commented Sep 4, 2024

My reasoning for doing it for the whole kernel in the PR was that if we are disabling it manually for classes that do this anyway, then it's better to not have it on. But I don't have a very strong preference.

If we can turn it on for the service, do you think it would be beneficial to have it selectively in the kernel too? It wouldn't work for anything that is started before libc (and after that we don't have full futex).

@MagnusS
Copy link
Member Author

MagnusS commented Sep 7, 2024

This should be fixed now that #2273 has been merged as we no longer use thread safe statics and disallow syscalls until musl has been initialised.

@MagnusS MagnusS closed this as completed Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants