Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot open file descriptor permanently unusable after "filesystem full" condition #1744

Closed
jjones-smug opened this issue Feb 20, 2016 · 9 comments
Assignees
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected

Comments

@jjones-smug
Copy link

We've run into an interesting error scenario where an attempt to write to the local snapshot files fails with a "no space left on device" error. The process seems to hold on to that error status and short-circuits subsequent write requests, indefinitely. So even after the filesystem issue is resolve, the consul process continues to SYSLOG "no space left on device" errors for each "tryAppend" call.

Running "strace" on the process show that no write() system calls are made to the file descriptor open against the snapshot file. Calls just result in writes to SYSLOG with an error returned to the application.

@jjones-smug
Copy link
Author

Bumping this one... Any thoughts?

@slackpad slackpad added the type/bug Feature does not function as expected label Mar 11, 2016
@slackpad
Copy link
Contributor

@jjones-smug sorry haven't had a chance to chase this down yet.

@jjones-smug
Copy link
Author

Bumping again, since I have a ticket open on this one that I'd like to close out.

@jjones-smug
Copy link
Author

@kunitake
Copy link

kunitake commented Aug 3, 2016

Hi,

I'll test it.

Added following line to /etc/fstab.

tmpfs   /opt/diskfull    tmpfs   defaults,noatime,size=250M 0 0

Set consul's data_dir to /opt/diskfull/consul

# systemctl stop consul
# mount -a
# systemctl start consul

And...

# dd if=/dev/zero of=/opt/diskfull/dummy.img bs=512 count=9999999999
dd: error writing ‘dummy.img’: No space left on device
511745+0 records in
511744+0 records out
262012928 bytes (262 MB) copied, 0.582111 s, 450 MB/s

It takes time to run out the resered area of local.snapshot file.
After reproduce the issue.

# rm /opt/diskfull/dummy.img

Although I got free space, outputting the error messages continue.

I hope the problem will be fixed by following patch.
-> #2236

best regards,

@SunSparc
Copy link

SunSparc commented Dec 19, 2016

I have a system that just ran into this problem as well:

consul: 2016/12/19 17:35:21 [ERR] serf: Failed to update snapshot: write /var/lib/consul/serf/local.snapshot: no space left on device

A df -h showed:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        72G   30G   39G  44% /

Though a day or so ago that same machine did indeed run out of space, but obviously that space has since been cleared up.

We did a systemctl restart consul and consul is happy again.

Consul version 0.7.1

@memelet
Copy link

memelet commented Mar 27, 2017

Just hit this (still) with 0.7.5 where agent never recovers after a disk full.

Also see this error in the log

==> Error starting agent: failed decoding service file "/var/lib/consul/services/00ea4b7117da453e50d3972b86dad0f8-648e8828-9969-2139-bf2a-bca0d3b17d27.tmp": unexpected end of JSON input```

And the agent will not start.

@slackpad slackpad removed this from the Triaged milestone Apr 18, 2017
@kaskavalci
Copy link
Contributor

Is there a workaround for this issue?

@slackpad slackpad added the theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization label May 25, 2017
@preetapan preetapan self-assigned this Jun 29, 2017
@preetapan
Copy link
Contributor

Fixed in #3236

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

7 participants