Consul fails to restart after data-dir
disk capacity reaches 100%
#3207
Labels
theme/internal-cleanup
Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization
Description of the Issue (and unexpected/desired result)
After a machine's disk reaches 100% capacity Consul fails to restart after capacity is made available.
This issue seems to be with a corrupt system state file being saved to disk and failing to be read on restart.
Error message:
This may be a duplicate of #3030 , but I wasn't 100% sure and I didn't want to "me too" the report.
What I expect to happen
I expect Consul to start normally after disk space has been freed up.
Potential solution
Note: I am currently running v0.8.3 I believe this bug to exist in
master
.The problem may be with the
writeFileAtomic
function.If
fh.Write
fails with a partial write, due to disk space running out, an error is returned but the partial file still exists. On restart of the agent,loadServices
function reads the directory contents and errors on the partially written file (invalid JSON).Perhaps a modified
writeFileAtomic
function might look something like:The
persistCheckState
function may also want to be similarly updated, but I'm not 100% sure of the impact this might have.Workaround
Remove the offending service file and restart Consul agent.
Related but separate issues
Note that I also have an issue with the Consul agent not able to save it's serf snapshot, but this is a separate issue and not directly related (#2236 and hashicorp/serf#428).
consul version
for both Client and ServerClient:
Consul v0.8.3
Server:
Consul v0.8.3
consul info
for both Client and ServerClient:
Server:
Note: The raft version is currently pinned to version 1. This is intentional and has no effect on this bug report.
Operating system and Environment details
Clients and servers are running Ubuntu 14.04 in AWS.
Reproduction steps
Reproduction is not 100% possible since the disk needs to fill up when the service "cache" is written to disk.
The text was updated successfully, but these errors were encountered: