-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: Add a Partial Deadlock Detector For Go #13759
Comments
Here is one approach that doesn't perturb the GC and its data structures Distinguish all objects being waited on and their associated Goroutines. Do At this point untraced Goroutines are blocked on objects that are only These Goroutines and associated objects are revealed to the user. To continue trace the perma-blocked Goroutines and finish the GC cycle. This approach solves the problem of having to maintain all paths to the Using GC for these types of things is a great idea. On Mon, Dec 28, 2015 at 7:26 PM, Richard Fliam notifications@github.com
|
Years ago I worked on a Java system that detected vanilla deadlocks on the fly. In a green-threads-like system (which Go also resembles) when a thread is about to enter a wait state because of a lock, if the lock records the thread that owns it (this was true in our implementation), the about-to-wait thread can trace from lock, to thread, to lock, to thread, etc, as long as the threads are in a wait state. If this search runs "too far" (e.g, more than 16) then the algorithm switches to serious deadlock detection and notes the current thread, and then the 32nd, 64th, etc, and checks each new thread encountered against those recorded to search for a duplicate. This is something that can take the place of spinning for a lock; our system didn't do any spinning by default and enabling the deadlock detector tended to improve performance slightly. The sort of information that we found useful in a deadlock was a stack trace for each thread involved. I think (not sure) that we actually marked threads on the fly while searching for deadlocks; a "searching" thread would abandon a search if it encountered some other thread's mark, and a thread would report a cycle if it encountered its own mark; however, I think that this was vulnerable to sometimes missing deadlocks in a race-to-deadlock, and also required more expensive load/store barriers. |
This proposal is okay in principle, it just needs someone to write a detailed proposal document that explains how it will be implemented. |
Ping? |
The current proposed solution relies on stop the worls pause to calculate
the set of deadlocked goroutines. This might be undesirable for our current
plan that aims to reduce STW pause as much as possible. If one does write a
detailed implementation proposal, please be sure to consider if it's
possible to do this fully concurrently with STW.
|
This issue requires more feedback from people who work on the runtime. Is this desirable? Or is it impractical given existing constraints? |
I'm not sure I fully understand what's being proposed. Traditional waits-for analysis doesn't require any sort of integration into the marking process, so I'm not sure what you're getting at there (I've done waits-for analysis on runtime locks using just a few hooks and a totally lame GDB script). I'm also not clear what edges you're adding to the waits-for graph in your detailed steps. ChannelsBut my main concern is that most synchronization in Go is done with channels and channels don't fit in to traditional waits-for analysis and deadlock detection. For example, if a goroutine is blocked on receiving from a channel, it could become unblocked at some unknown point in the future if the send side of that channel is reachable by any non-blocked goroutine. But reachability is far from a sufficient condition for a send to happen. Channels don't meet the "necessary conditions" you linked to. You could in principle trace the reachability of channels (perhaps this is what you're getting at by integrating it into the garbage collector?), but that has both false positives and false negatives. A channel may be reachable by a goroutine, but that goroutine will never send on it, in which case we cannot detect a deadlock even if there is one (as a simple case, the channel could be globally reachable). On the flip side, it's perfectly valid (if a bit odd) to do something like MutexesBut even if we restrict ourselves to sync.Mutex, we still have to consider reachability because sync.Mutex can be unlocked by any goroutine, not just the one that last locked it (perhaps that was a mistake). Considering reachabilityCan we account for reachability? I think we could, but it's actually quite onerous, since you have to know that the mutex (or channel) are reachable specifically from a non-blocked goroutine. Currently, the garbage collector has no idea what's reachable from what. Distinguishing multiple types of reachability means it would have to flood a secondary property through the heap graph. Propagating that information requires either performing a second full mark phase, or dividing the mark phase into two strictly separate sub-phases where the first starts from just non-blocked goroutines and global roots and finishes a complete graph flood before starting a second flood from blocked goroutines. Counter-proposalOf course, I agree that it would be a great debugging tool to be able to find deadlocks. Here's a counter-proposal. Rather than trying to do this online while a program is running, do it offline, perhaps as a heap dump analysis (once we have a library for that). The runtime could help out by recording which goroutine holds each mutex. I posit that most of the time you know a deadlock has happened and the tricky part is figuring out exactly what's involved in the deadlock. Deadlocks, conveniently, sit around waiting to debugged, so you could capture a heap dump and let a tool find cycles for you. These cycles could even be potential cycles (flagged as such), where the channel or mutex is still reachable from a non-blocked goroutine, but maybe that goroutine is never going to send/unlock. A channel or mutex that's only globally reachable is probably a good signal here (and requires yet another reachability analysis). |
I feel this is a bit of a semantic disagreement. A goroutine that will never proceed but can not be garbage collected is "deadlocked" to me. We can detect a subset of goroutines that will never proceed. I don't think anyone would expect us to detect all goroutines which will not proceed (isn't that the halting problem?). We could certainly collect these goroutines, and that may be very valuable. I was, however, concerned about the performance impact if we change the GC to detect them. Like race I had hoped this feature to be low overhead... not no overhead.
Yes, reachability is why I proposed linking into the garbage collector.
My proposal was focused on the two-phase mark, though I see I did not make that clear. Practically the number of roots (being driven mostly by the number of goroutines) far exceeds the number of cores we have to mark anyways. Prioritizing unblocked goroutines shouldn't hurt performance too badly. The details of the GC (especially the 1.8 and 1.9 gc's) are not familiar to me. So the amount of work to divide the mark into two phases isn't something I can estimate. Keep in mind the intention is that this, like race, would not be enabled by default to avoid performance penalties.
I've actually done this and we have used it internally. The problem is, like races, one generally doesn't know to start looking until your application has died in production. And then you can't fetch a heap. Deadlocks and leaked goroutines have even led some to decry their usage. Anecdotally, I have heard similar sentiments from engineers at Gophercon. This would also obviate the need for features like #6705 and lots of related test code. |
This sounds like a great idea. The devil is in the details. We'll mark this proposal accepted to reflect that we're certainly in favor of having this kind of functionality, but we are not ourselves planning to do the work. If you would like to do this and can make it work, that would be great. |
The Proposal
Deadlock's make programs fail. Partial deadlocks (where a subset of the goroutines in a system are deadlocked) are difficult to find and debug. While predicting the occurrences of these deadlocks is not computable, detecting them when they occur is. Adding in a partial deadlock detector would be a powerful tool in debugging our concurrent programs.
Like the race detector, a flag-able deadlock detector would use environment variables to log, or exit on detected deadlocks. The deadlock detector would not detect all deadlocks, or potential deadlocks, only a subset of those that are occurring.
The goal of this proposal is to see if such a feature is desirable, and present a plausible implementation. If it is desirable a detailed design doc will follow.
Is this the Halting Problem?
No. If the necessary conditions are met the program will not continue.
Proposed Implementation
High Level
In short the proposed implementation is to "hook" into the garbage collector, and use it's mark phase to build a wait-for graph. Then use Tarjan's algorithm to find deadlock cycles. That is the strongly connect components with no edges to non deadlocked components.
More Detailed
As go uses a mark-sweep garbage collector most of the computationally expensive work necessary for dead lock detection is already being done. Here is how the proposed detector would work (in very high level terms):
Notes
I have intentionally glossed over:
for simplicity's sake.
The text was updated successfully, but these errors were encountered: