Consul fails to restart after `data-dir` disk capacity reaches 100% #3207

agy · 2017-06-29T21:57:35Z

Description of the Issue (and unexpected/desired result)

After a machine's disk reaches 100% capacity Consul fails to restart after capacity is made available.

This issue seems to be with a corrupt system state file being saved to disk and failing to be read on restart.

Error message:

==> Error starting agent: failed decoding service file "/var/lib/consul/services/10ebf31202f269c5aaf1a0299b36cba6-a1c08189-2bbc-dfa6-d465-df9a6c086b05.tmp": unexpected end of JSON input

This may be a duplicate of #3030 , but I wasn't 100% sure and I didn't want to "me too" the report.

What I expect to happen

I expect Consul to start normally after disk space has been freed up.

Potential solution

Note: I am currently running v0.8.3 I believe this bug to exist in master.

The problem may be with the writeFileAtomic function.

If fh.Write fails with a partial write, due to disk space running out, an error is returned but the partial file still exists. On restart of the agent, loadServices function reads the directory contents and errors on the partially written file (invalid JSON).

Perhaps a modified writeFileAtomic function might look something like:

// Caveat lector: This is untested code
func writeFileAtomic(path string, contents []byte) error {
	uuid, err := uuid.GenerateUUID()
	if err != nil {
		return err
	}
	tempPath := fmt.Sprintf("%s-%s.tmp", path, uuid)

	if err := os.MkdirAll(filepath.Dir(path), 0700); err != nil {
		return err
	}
	fh, err := os.OpenFile(tempPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0600)
	if err != nil {
		return err
	}
        /* Note to reviewer:
            multierror could potentially be used to wrap multiple errors,
            however I didn't want to complicate things and potentially
            return errors which don't really add anything
        */
	if _, err := fh.Write(contents); err != nil {
		// We failed to write some or all of the contents
		// Attempt to clean up after ourselves and return an error
		fh.Close()
		os.Remove(tempPath)
		return err
	}
	if err := fh.Sync(); err != nil {
		// We sync our writes and commit to disk
		// Attempt to clean up after ourselves and return an error
		fh.Close()
		os.Remove(tempPath)
		return err
	}
	if err := fh.Close(); err != nil {
		// Attempt to clean up after ourselves and return an error
		os.Remove(tempPath)
		return err
	}
	return os.Rename(tempPath, path)
}

The persistCheckState function may also want to be similarly updated, but I'm not 100% sure of the impact this might have.

Workaround

Remove the offending service file and restart Consul agent.

Related but separate issues

Note that I also have an issue with the Consul agent not able to save it's serf snapshot, but this is a separate issue and not directly related (#2236 and hashicorp/serf#428).

`consul version` for both Client and Server

Client: Consul v0.8.3
Server: Consul v0.8.3

`consul info` for both Client and Server

Client:

agent:
	check_monitors = 1
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease =
	revision = ea2a82b
	version = 0.8.3
consul:
	known_servers = 5
	server = false
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 73
	max_procs = 4
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 2873
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 157
	member_time = 171466
	members = 4318
	query_queue = 0
	query_time = 39

Server:

agent:
	check_monitors = 1
	check_ttls = 0
	checks = 1
	services = 2
build:
	prerelease =
	revision = ea2a82b
	version = 0.8.3
consul:
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = xxx.xxx.xxx.xxx:8300
	server = true
raft:
	applied_index = 71679317
	commit_index = 71679317
	fsm_pending = 0
	last_contact = 39.076134ms
	last_log_index = 71679318
	last_log_term = 10483
	last_snapshot_index = 71673263
	last_snapshot_term = 10483
	latest_configuration = [{Suffrage:Voter ID:xxx.xxx.xxx.xxx:8300 Address:xxx.xxx.xxx.xxx:8300} {Suffrage:Voter ID:yyy.yyy.yyy.yyy:8300 Address:yyy.yyy.yyy.yyy:8300} {Suffrage:Voter ID:zzz.zzz.zzz.zzz:8300 Address:zzz.zzz.zzz.zzz:8300} {Suffrage:Voter ID:aaa.aaa.aaa.aaa:8300 Address:aaa.aaa.aaa.aaa:8300} {Suffrage:Voter ID:bbb.bbb.bbb.bbb:8300 Address:bbb.bbb.bbb.bbb:8300}]
	latest_configuration_index = 53970496
	num_peers = 4
	protocol_version = 1
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 10483
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 14817
	max_procs = 8
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 2873
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 156
	member_time = 171466
	members = 4317
	query_queue = 0
	query_time = 39
serf_wan:
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 7037
	members = 5
	query_queue = 0
	query_time = 1

Note: The raft version is currently pinned to version 1. This is intentional and has no effect on this bug report.

Operating system and Environment details

Clients and servers are running Ubuntu 14.04 in AWS.

Reproduction steps

Reproduction is not 100% possible since the disk needs to fill up when the service "cache" is written to disk.

The text was updated successfully, but these errors were encountered:

…rvice files on load with a warning. This fixes #3207

preetapan self-assigned this Jun 29, 2017

preetapan added the theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization label Jun 29, 2017

preetapan pushed a commit that referenced this issue Jul 24, 2017

Clean up temporary files on write errors, and ignore any temporary se…

c26fd66

…rvice files on load with a warning. This fixes #3207

preetapan mentioned this issue Jul 24, 2017

Clean up temporary files on write errors, and ignore any temporary se… #3318

Merged

preetapan closed this as completed in #3318 Jul 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul fails to restart after `data-dir` disk capacity reaches 100% #3207

Consul fails to restart after `data-dir` disk capacity reaches 100% #3207

agy commented Jun 29, 2017

Consul fails to restart after data-dir disk capacity reaches 100% #3207

Consul fails to restart after data-dir disk capacity reaches 100% #3207

Comments

agy commented Jun 29, 2017

Description of the Issue (and unexpected/desired result)

Potential solution

Workaround

Related but separate issues

consul version for both Client and Server

consul info for both Client and Server

Operating system and Environment details

Reproduction steps

Consul fails to restart after `data-dir` disk capacity reaches 100% #3207

Consul fails to restart after `data-dir` disk capacity reaches 100% #3207

`consul version` for both Client and Server

`consul info` for both Client and Server