Deadlock in consul server between leadership reconcile channel and barrier write #3230

preetapan · 2017-07-05T20:21:46Z

`consul version` for both Client and Server

Client: 0.8.5
Server: 0.8.5

Operating system and Environment details

Tested in Linux, but should be reproducible in other environments

Description of the Issue (and unexpected/desired result)

Reproduction steps

I found this when working on #1744.

Run consul server with bootstrap-expect=1, and make it lose leadership by filling up disk (although this can be any resource, not just disk). After a few iterations, leader election messages stop being logged, and going to debug/pprof/goroutine?debug=1 should show the deadlocked go-routines. When the node gives up leadership because of a disk write error, Raft's runLeader is waiting for the reconcile channel to be free to write to it, but consul's monitorLeadership method is unable to read from the reconcile channel to drain it because its waiting for the barrier write to finish. The barrier write blocks because even though the node's state is set to be a follower, it waits till the runLeader loop finishes so that the runFollower goroutine can process the apply channel containing the barrier write. Thus the deadlock!

Discussed solution ideas with @slackpad, and simplest thing to do is to not make the barrier write block forever. It should timeout after a conservative time period, and return rather than waiting if there is a barrier error.

Log Fragments or Link to gist

(see goroutine dump attachment)
Include appropriate Client or Server log fragments. If the log is longer
than a few dozen lines, please include the URL to the
gist.

TIP: Use -log-level=TRACE on the client and server to capture the maximum log detail.

The text was updated successfully, but these errors were encountered:

preetapan · 2017-07-05T20:23:52Z

goroutine.txt

…ixes #3230

There was a deadlock issue we fixed under hashicorp/consul#3230, and then discovered an issue with under hashicorp/consul#3545. This PR ports over those fixes, as well as makes the revoke actions only happen if leadership was established. This brings the Nomad leader loop inline with Consul's.

preetapan added the type/bug Feature does not function as expected label Jul 5, 2017

preetapan self-assigned this Jul 5, 2017

preetapan pushed a commit that referenced this issue Jul 5, 2017

Fixes deadlock between barrier write and leader notify channel read . F…

f2171a6

…ixes #3230

preetapan mentioned this issue Jul 5, 2017

Fixes deadlock between barrier write and leader notify channel read .… #3231

Merged

preetapan closed this as completed in #3231 Jul 6, 2017

slackpad mentioned this issue Aug 29, 2017

node failure, stale reads and monitoring #3285

Open

slackpad mentioned this issue Oct 5, 2017

Barrier write timeout can cause permanently degraded leader #3545

Closed

slackpad mentioned this issue Oct 17, 2017

Applies leader loop fixes from Consul. hashicorp/nomad#3402

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in consul server between leadership reconcile channel and barrier write #3230

Deadlock in consul server between leadership reconcile channel and barrier write #3230

preetapan commented Jul 5, 2017 •

edited

Loading

preetapan commented Jul 5, 2017

Deadlock in consul server between leadership reconcile channel and barrier write #3230

Deadlock in consul server between leadership reconcile channel and barrier write #3230

Comments

preetapan commented Jul 5, 2017 • edited Loading

consul version for both Client and Server

Operating system and Environment details

Description of the Issue (and unexpected/desired result)

Reproduction steps

Log Fragments or Link to gist

preetapan commented Jul 5, 2017

preetapan commented Jul 5, 2017 •

edited

Loading

`consul version` for both Client and Server