fix crashes with not-yet fully-initialized stackprof global state #226

froydnj · 2024-02-05T15:50:13Z

We've been observing rare crashes in several of our Ruby services that seem to implicate races in the stackprof extension. This PR proposes a fix that we think would eliminate at least one of the races.

We sample a percentage of requests with stackprof in wall mode, and noticed that crashes happened more consistently on requests that had stackprof sampling turned on. When we turned stackprof off, the crashes went away. We looked more closely at the crashes, and saw native C frames (printed via Ruby's segfault handler) at the interesting point that looked like:

/path/to/ruby/bin/../lib/libruby.so.3.1(sigsegv+0x4d) [0x7f5034fe53ed] signal.c:964
/lib/x86_64-linux-gnu/libpthread.so.0(__restore_rt+0x0) [0x7f5034e01420]
/lib/x86_64-linux-gnu/libpthread.so.0(pthread_kill+0xb) [0x7f5034dfdf3b]
/lib/x86_64-linux-gnu/libpthread.so.0(0x14420) [0x7f5034e01420]
/lib/x86_64-linux-gnu/libc.so.6(ppoll+0xa6) [0x7f5034aedcb6]
/path/to/ruby/bin/../lib/libruby.so.3.1(native_ppoll_sleep+0x162) [0x7f5035020162] thread_pthread.c:2200

This is extremely peculiar: we are executing ppoll, we get a signal, we pthread_kill from that signal, and we SIGSEGV in pthread_kill.

Inspecting the register state for the SIGSEGV says that pthread_kill is receiving a NULL pthread_t and is trying to pass along SIGALRM. The only place we can find in our Ruby application that uses pthread_kill and would be passing SIGALRM are these lines from the stackprof extension:

stackprof/ext/stackprof/stackprof.c

Lines 757 to 767 in ebdd3af

    
           if (_stackprof.mode == sym_wall) { 
        
               // In "wall" mode, the SIGALRM signal will arrive at an arbitrary thread. 
        
               // In order to provide more useful results, especially under threaded web 
        
               // servers, we want to forward this signal to the original thread 
        
               // StackProf was started from. 
        
               // According to POSIX.1-2008 TC1 pthread_kill and pthread_self should be 
        
               // async-signal-safe. 
        
               if (pthread_self() != _stackprof.target_thread) { 
        
                   pthread_kill(_stackprof.target_thread, sig); 
        
                   return; 
        
               }

Notice that this code only triggers when we are profiling in wall mode, which is consistent with the failure mode that we described above.

So our hypothesis is that we're getting a stackprof signal, we forward the signal on, but somehow we are passing a NULL pthread_t here. (Technically there's no such thing, but this is how the implementation works on Linux. Note too that we cannot portably test for NULL pthread_t...but I guess we could consider doing that as a follow-on?) But how would _stackprof.target_thread be NULL? We looked at the initialization code:

stackprof/ext/stackprof/stackprof.c

Lines 220 to 228 in ebdd3af

    
           _stackprof.running = 1; 
        
           _stackprof.raw = raw; 
        
           _stackprof.aggregate = aggregate; 
        
           _stackprof.mode = mode; 
        
           _stackprof.interval = interval; 
        
           _stackprof.ignore_gc = ignore_gc; 
        
           _stackprof.metadata = metadata; 
        
           _stackprof.out = out; 
        
           _stackprof.target_thread = pthread_self();

Notice that we set _stackprof.running prior to the initialization of _stackprof.target_thread. So the code in the signal handler:

stackprof/ext/stackprof/stackprof.c

Lines 743 to 757 in ebdd3af

    
           static void 
        
           stackprof_signal_handler(int sig, siginfo_t *sinfo, void *ucontext) 
        
           { 
        
               static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; 
        
               _stackprof.overall_signals++; 
        
               if (!_stackprof.running) return; 
        
               // There's a possibility that the signal handler is invoked *after* the Ruby 
        
               // VM has been shut down (e.g. after ruby_cleanup(0)). In this case, things 
        
               // that rely on global VM state (e.g. rb_during_gc) will segfault. 
        
               if (!ruby_vm_running) return; 
        
               if (_stackprof.mode == sym_wall) {

can see that we are running, despite the rest of the fields in _stackprof not yet having been initialized, including target_thread.

That is, the sequence of events looks like:

We register a signal handler and set a ITIMER_REAL timer.
We set _stackprof.running.
The timer goes off at some point prior to setting _stackprof.target_thread.
We deliver the signal to some thread; it doesn't matter what thread this is.
The signal handler sees _stackprof.running is true.
The signal handler sees that we are not on _stackprof.target_thread, as it hasn't been set yet.
The signal handler forwards the signal, but _stackprof.target_thread is not a valid thread ID.
pthread_kill attempts to load data off a null pointer, and we crash.

This would be a pretty rare race condition, but it very much lines up with the facts that we have.

The fix proposed here is in two parts: we need to not set _stackprof.running until all the relevant state has been initialized. But we need to go a little further than that, because there's no relation between the effects of the store to _stackprof.running and the store to _stackprof.target_thread. So we need to ensure that all prior stores are visible to all threads after _stackprof.running has been set; otherwise, we could still observe a true _stackprof.running and a not-yet-initialized _stackprof.target_thread on some other thread. We can only ensure the visibility of those stores with atomic instructions, and since this gem purports to support many different Ruby versions, we have to roll our own support for atomics. And finally, setting _stackprof.running with an atomic instruction effectively makes the type of that field an atomic type, so we should only access it via atomic instructions where supported.

Tested on arm64 darwin; rake test appears to hang, but I verified that it hangs in the same place (via --seed=) before and after this fix, so I don't think this fix changes anything.

froydnj · 2024-02-06T17:45:45Z

This looks like it triggers an rb_bug in Ruby 3.1.4. :( Unsure whether that's a problem in the PR, or a latent bug in Ruby itself.

froydnj · 2024-02-20T21:12:15Z

After the fixed commit last week, this passes tests on Ruby 3.1 on OS X.

froydnj · 2024-02-20T21:48:13Z

I can see the same intermittent failure as "CI / platforms (macos, 3.0)" on my local arm64 laptop (though I more often see test_raw failing)

I'm not sure what's going on with the failure in "CI / rubies (ruby-head)", unless ruby HEAD has changed how it reports allocations. I very rarely see that same failure locally on my laptop (maybe 1 out of every 200-300 rake test runs).

don't set running until all relevant state is initialized

08b5127

be more diligent about atomic operations

4e504d3

froydnj force-pushed the froydnj-fix-timing-crash branch from 268fd02 to 4e504d3 Compare February 16, 2024 19:49

XrXr mentioned this pull request Dec 11, 2024

Fix race with timer signal in stackprof_start() #233

Closed

tenderlove approved these changes Dec 18, 2024

View reviewed changes

tenderlove merged commit d90ad35 into tmm1:master Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix crashes with not-yet fully-initialized stackprof global state #226

fix crashes with not-yet fully-initialized stackprof global state #226

Uh oh!

froydnj commented Feb 5, 2024

Uh oh!

froydnj commented Feb 6, 2024

Uh oh!

froydnj commented Feb 20, 2024

Uh oh!

froydnj commented Feb 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if (_stackprof.mode == sym_wall) {
	// In "wall" mode, the SIGALRM signal will arrive at an arbitrary thread.
	// In order to provide more useful results, especially under threaded web
	// servers, we want to forward this signal to the original thread
	// StackProf was started from.
	// According to POSIX.1-2008 TC1 pthread_kill and pthread_self should be
	// async-signal-safe.
	if (pthread_self() != _stackprof.target_thread) {
	pthread_kill(_stackprof.target_thread, sig);
	return;
	}

	_stackprof.running = 1;
	_stackprof.raw = raw;
	_stackprof.aggregate = aggregate;
	_stackprof.mode = mode;
	_stackprof.interval = interval;
	_stackprof.ignore_gc = ignore_gc;
	_stackprof.metadata = metadata;
	_stackprof.out = out;
	_stackprof.target_thread = pthread_self();

	static void
	stackprof_signal_handler(int sig, siginfo_t sinfo, void ucontext)
	{
	static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

	_stackprof.overall_signals++;

	if (!_stackprof.running) return;

	// There's a possibility that the signal handler is invoked after the Ruby
	// VM has been shut down (e.g. after ruby_cleanup(0)). In this case, things
	// that rely on global VM state (e.g. rb_during_gc) will segfault.
	if (!ruby_vm_running) return;

	if (_stackprof.mode == sym_wall) {

fix crashes with not-yet fully-initialized stackprof global state #226

fix crashes with not-yet fully-initialized stackprof global state #226

Uh oh!

Conversation

froydnj commented Feb 5, 2024

Uh oh!

froydnj commented Feb 6, 2024

Uh oh!

froydnj commented Feb 20, 2024

Uh oh!

froydnj commented Feb 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants