fix crashes with not-yet fully-initialized stackprof global state #226
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We've been observing rare crashes in several of our Ruby services that seem to implicate races in the stackprof extension. This PR proposes a fix that we think would eliminate at least one of the races.
We sample a percentage of requests with stackprof in wall mode, and noticed that crashes happened more consistently on requests that had stackprof sampling turned on. When we turned stackprof off, the crashes went away. We looked more closely at the crashes, and saw native C frames (printed via Ruby's segfault handler) at the interesting point that looked like:
This is extremely peculiar: we are executing
ppoll
, we get a signal, wepthread_kill
from that signal, and we SIGSEGV inpthread_kill
.Inspecting the register state for the SIGSEGV says that
pthread_kill
is receiving aNULL
pthread_t
and is trying to pass along SIGALRM. The only place we can find in our Ruby application that usespthread_kill
and would be passing SIGALRM are these lines from the stackprof extension:stackprof/ext/stackprof/stackprof.c
Lines 757 to 767 in ebdd3af
Notice that this code only triggers when we are profiling in wall mode, which is consistent with the failure mode that we described above.
So our hypothesis is that we're getting a stackprof signal, we forward the signal on, but somehow we are passing a
NULL
pthread_t
here. (Technically there's no such thing, but this is how the implementation works on Linux. Note too that we cannot portably test forNULL
pthread_t
...but I guess we could consider doing that as a follow-on?) But how would_stackprof.target_thread
beNULL
? We looked at the initialization code:stackprof/ext/stackprof/stackprof.c
Lines 220 to 228 in ebdd3af
Notice that we set
_stackprof.running
prior to the initialization of_stackprof.target_thread
. So the code in the signal handler:stackprof/ext/stackprof/stackprof.c
Lines 743 to 757 in ebdd3af
can see that we are running, despite the rest of the fields in
_stackprof
not yet having been initialized, includingtarget_thread
.That is, the sequence of events looks like:
ITIMER_REAL
timer._stackprof.running
._stackprof.target_thread
._stackprof.running
is true._stackprof.target_thread
, as it hasn't been set yet._stackprof.target_thread
is not a valid thread ID.pthread_kill
attempts to load data off a null pointer, and we crash.This would be a pretty rare race condition, but it very much lines up with the facts that we have.
The fix proposed here is in two parts: we need to not set
_stackprof.running
until all the relevant state has been initialized. But we need to go a little further than that, because there's no relation between the effects of the store to_stackprof.running
and the store to_stackprof.target_thread
. So we need to ensure that all prior stores are visible to all threads after_stackprof.running
has been set; otherwise, we could still observe a true_stackprof.running
and a not-yet-initialized_stackprof.target_thread
on some other thread. We can only ensure the visibility of those stores with atomic instructions, and since this gem purports to support many different Ruby versions, we have to roll our own support for atomics. And finally, setting_stackprof.running
with an atomic instruction effectively makes the type of that field an atomic type, so we should only access it via atomic instructions where supported.Tested on arm64 darwin;
rake test
appears to hang, but I verified that it hangs in the same place (via--seed=
) before and after this fix, so I don't think this fix changes anything.