Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Re-add zombie leader on exit, gdb/linux
The current zombie leader detection code in linux-nat.c has a race -- if a multi-threaded inferior exits just before check_zombie_leaders finds that the leader is now zombie via checking /proc/PID/status, check_zombie_leaders deletes the leader, assuming we won't get an event for that exit (which we won't in some scenarios, but not in this one). That might seem mostly harmless, but it has some downsides: - later when we continue pulling events out of the kernel, we will collect the exit event of the non-leader threads, and once we see the last lwp in our list exit, we return _that_ lwp's exit code as whole-process exit code to infrun, instead of the leader's exit code. - this can cause a hang in stop_all_threads in infrun.c. Say there are 2 threads in the process. stop_all_threads stops each of those threads, and then waits for two stop or exit events, one for each thread. If the whole process exits, and check_zombie_leaders hits the false-positive case, linux-nat.c will only return one event to GDB (the whole-process exit returned when we see the last thread, the non-leader thread, exit), making stop_all_threads hang forever waiting for a second event that will never come. However, in this false-positive scenario, where the whole process is exiting, as opposed to just the leader (with pthread_exit(), for example), we _will_ get an exit event shortly for the leader, after we collect the exit event of all the other non-leader threads. Or put another way, we _always_ get an event for the leader after we see it become zombie. I tried a number of approaches to fix this: #1 - My first thought to address the race was to make GDB always report the whole-process exit status for the leader thread, not for whatever is the last lwp in the list. We _always_ get a final exit (or exec) event for the leader, and when the race triggers, we're not collecting it. #2 - My second thought was to try to plug the race in the first place. I thought of making GDB call waitpid/WNOHANG for all non-leader threads immediately when the zombie leader is detected, assuming there would be an exit event pending for each of them waiting to be collected. Turns out that that doesn't work -- you can see the leader become zombie _before_ the kernel kills all other threads. Waitpid in that small time window returns 0, indicating no-event. Thankfully we hit that race window all the time, which avoided trading one race for another. Looking at the non-leader thread's status in /proc doesn't help either, the threads are still in running state for a bit, for the same reason. bminor#3 - My next attempt, which seemed promising, was to synchronously stop and wait for the stop for each of the non-leader threads. For the scenario in question, this will collect all the exit statuses of the non-leader threads. Then, if we are left with only the zombie leader in the lwp list, it means we either have a normal while-process exit or an exec, in which case we should not delete the leader. If _only_ the leader exited, like in gdb.threads/leader-exit.exp, then after pausing threads, we will still have at least one live non-leader thread in the list, and so we delete the leader lwp. I got this working and polished, and it was only after staring at the kernel code to convince myself that this would really work (and it would, for the scenario I considered), that I realized I had failed to account for one scenario -- if any non-leader thread is _already_ stopped when some thread triggers a group exit, like e.g., if you have some threads stopped and then resume just one thread with scheduler-locking or non-stop, and that thread exits the process. I also played with PTRACE_EVENT_EXIT, see if it would help in any way to plug the race, and I couldn't find a way that it would result in any practical difference compared to looking at /proc/PID/status, with respect to having a race. So I concluded that there's no way to plug the race, we just have to deal with it. Which means, going back to approach #1. That is the approach taken by this patch. Change-Id: I6309fd4727da8c67951f9cea557724b77e8ee979
- Loading branch information