You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run consul server with bootstrap-expect=1, and make it lose leadership by filling up disk (although this can be any resource, not just disk). After a few iterations, leader election messages stop being logged, and going to debug/pprof/goroutine?debug=1 should show the deadlocked go-routines. When the node gives up leadership because of a disk write error, Raft's runLeader is waiting for the reconcile channel to be free to write to it, but consul's monitorLeadership method is unable to read from the reconcile channel to drain it because its waiting for the barrier write to finish. The barrier write blocks because even though the node's state is set to be a follower, it waits till the runLeader loop finishes so that the runFollower goroutine can process the apply channel containing the barrier write. Thus the deadlock!
Discussed solution ideas with @slackpad, and simplest thing to do is to not make the barrier write block forever. It should timeout after a conservative time period, and return rather than waiting if there is a barrier error.
(see goroutine dump attachment)
Include appropriate Client or Server log fragments. If the log is longer
than a few dozen lines, please include the URL to the gist.
TIP: Use -log-level=TRACE on the client and server to capture the maximum log detail.
The text was updated successfully, but these errors were encountered:
There was a deadlock issue we fixed under hashicorp/consul#3230,
and then discovered an issue with under hashicorp/consul#3545. This
PR ports over those fixes, as well as makes the revoke actions only happen if leadership was
established. This brings the Nomad leader loop inline with Consul's.
consul version
for both Client and ServerClient:
0.8.5
Server:
0.8.5
Operating system and Environment details
Tested in Linux, but should be reproducible in other environments
Description of the Issue (and unexpected/desired result)
Reproduction steps
I found this when working on #1744.
Run consul server with bootstrap-expect=1, and make it lose leadership by filling up disk (although this can be any resource, not just disk). After a few iterations, leader election messages stop being logged, and going to debug/pprof/goroutine?debug=1 should show the deadlocked go-routines. When the node gives up leadership because of a disk write error, Raft's runLeader is waiting for the reconcile channel to be free to write to it, but consul's monitorLeadership method is unable to read from the reconcile channel to drain it because its waiting for the barrier write to finish. The barrier write blocks because even though the node's state is set to be a follower, it waits till the runLeader loop finishes so that the runFollower goroutine can process the apply channel containing the barrier write. Thus the deadlock!
Discussed solution ideas with @slackpad, and simplest thing to do is to not make the barrier write block forever. It should timeout after a conservative time period, and return rather than waiting if there is a barrier error.
Log Fragments or Link to gist
(see goroutine dump attachment)
Include appropriate Client or Server log fragments. If the log is longer
than a few dozen lines, please include the URL to the
gist.
TIP: Use
-log-level=TRACE
on the client and server to capture the maximum log detail.The text was updated successfully, but these errors were encountered: