Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul fails to restart after data-dir disk capacity reaches 100% #3207

Closed
agy opened this issue Jun 29, 2017 · 0 comments · Fixed by #3318
Closed

Consul fails to restart after data-dir disk capacity reaches 100% #3207

agy opened this issue Jun 29, 2017 · 0 comments · Fixed by #3318
Assignees
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization

Comments

@agy
Copy link
Contributor

agy commented Jun 29, 2017

Description of the Issue (and unexpected/desired result)

After a machine's disk reaches 100% capacity Consul fails to restart after capacity is made available.

This issue seems to be with a corrupt system state file being saved to disk and failing to be read on restart.

Error message:

==> Error starting agent: failed decoding service file "/var/lib/consul/services/10ebf31202f269c5aaf1a0299b36cba6-a1c08189-2bbc-dfa6-d465-df9a6c086b05.tmp": unexpected end of JSON input

This may be a duplicate of #3030 , but I wasn't 100% sure and I didn't want to "me too" the report.

What I expect to happen

I expect Consul to start normally after disk space has been freed up.

Potential solution

Note: I am currently running v0.8.3 I believe this bug to exist in master.

The problem may be with the writeFileAtomic function.

If fh.Write fails with a partial write, due to disk space running out, an error is returned but the partial file still exists. On restart of the agent, loadServices function reads the directory contents and errors on the partially written file (invalid JSON).

Perhaps a modified writeFileAtomic function might look something like:

// Caveat lector: This is untested code
func writeFileAtomic(path string, contents []byte) error {
	uuid, err := uuid.GenerateUUID()
	if err != nil {
		return err
	}
	tempPath := fmt.Sprintf("%s-%s.tmp", path, uuid)

	if err := os.MkdirAll(filepath.Dir(path), 0700); err != nil {
		return err
	}
	fh, err := os.OpenFile(tempPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0600)
	if err != nil {
		return err
	}
        /* Note to reviewer:
            multierror could potentially be used to wrap multiple errors,
            however I didn't want to complicate things and potentially
            return errors which don't really add anything
        */
	if _, err := fh.Write(contents); err != nil {
		// We failed to write some or all of the contents
		// Attempt to clean up after ourselves and return an error
		fh.Close()
		os.Remove(tempPath)
		return err
	}
	if err := fh.Sync(); err != nil {
		// We sync our writes and commit to disk
		// Attempt to clean up after ourselves and return an error
		fh.Close()
		os.Remove(tempPath)
		return err
	}
	if err := fh.Close(); err != nil {
		// Attempt to clean up after ourselves and return an error
		os.Remove(tempPath)
		return err
	}
	return os.Rename(tempPath, path)
}

The persistCheckState function may also want to be similarly updated, but I'm not 100% sure of the impact this might have.

Workaround

Remove the offending service file and restart Consul agent.

Related but separate issues

Note that I also have an issue with the Consul agent not able to save it's serf snapshot, but this is a separate issue and not directly related (#2236 and hashicorp/serf#428).

consul version for both Client and Server

Client: Consul v0.8.3
Server: Consul v0.8.3

consul info for both Client and Server

Client:

agent:
	check_monitors = 1
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease =
	revision = ea2a82b
	version = 0.8.3
consul:
	known_servers = 5
	server = false
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 73
	max_procs = 4
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 2873
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 157
	member_time = 171466
	members = 4318
	query_queue = 0
	query_time = 39

Server:

agent:
	check_monitors = 1
	check_ttls = 0
	checks = 1
	services = 2
build:
	prerelease =
	revision = ea2a82b
	version = 0.8.3
consul:
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = xxx.xxx.xxx.xxx:8300
	server = true
raft:
	applied_index = 71679317
	commit_index = 71679317
	fsm_pending = 0
	last_contact = 39.076134ms
	last_log_index = 71679318
	last_log_term = 10483
	last_snapshot_index = 71673263
	last_snapshot_term = 10483
	latest_configuration = [{Suffrage:Voter ID:xxx.xxx.xxx.xxx:8300 Address:xxx.xxx.xxx.xxx:8300} {Suffrage:Voter ID:yyy.yyy.yyy.yyy:8300 Address:yyy.yyy.yyy.yyy:8300} {Suffrage:Voter ID:zzz.zzz.zzz.zzz:8300 Address:zzz.zzz.zzz.zzz:8300} {Suffrage:Voter ID:aaa.aaa.aaa.aaa:8300 Address:aaa.aaa.aaa.aaa:8300} {Suffrage:Voter ID:bbb.bbb.bbb.bbb:8300 Address:bbb.bbb.bbb.bbb:8300}]
	latest_configuration_index = 53970496
	num_peers = 4
	protocol_version = 1
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 10483
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 14817
	max_procs = 8
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 2873
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 156
	member_time = 171466
	members = 4317
	query_queue = 0
	query_time = 39
serf_wan:
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 7037
	members = 5
	query_queue = 0
	query_time = 1

Note: The raft version is currently pinned to version 1. This is intentional and has no effect on this bug report.

Operating system and Environment details

Clients and servers are running Ubuntu 14.04 in AWS.

Reproduction steps

Reproduction is not 100% possible since the disk needs to fill up when the service "cache" is written to disk.

@preetapan preetapan self-assigned this Jun 29, 2017
@preetapan preetapan added the theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization label Jun 29, 2017
preetapan pushed a commit that referenced this issue Jul 24, 2017
…rvice files on load with a warning. This fixes #3207
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants